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Abstract 

In this article, we consider a simple representation for real numbers 
and propose top-down procedures to approximate various algebraic and 
transcendental operations with arbitrary precision. Detailed algorithms 
and proofs are provided to guarantee the correctness of the approxima¬ 
tions. Moreover, we develop and apply a perturbation analysis method 
to show that our approximation procedures only recompute expressions 
when unavoidable. 

In the last decade, various theories have been developed and imple¬ 
mented to realize real computations with arbitrary precision. Proof of 
correctness for existing approaches typically consider basic algebraic op¬ 
erations, whereas detailed arguments about transcendental operations are 
not available. Another important observation is that in each approach 
some expressions might require iterative computations to guarantee the 
desired precision. However, no formal reasoning is provided to prove that 
such iterative calculations are essential in the approximation procedures. 
In our approximations of real functions, we explicitly relate the precision 
of the inputs to the guaranteed precision of the output, provide full proofs 
and a precise analysis of the necessity of iterations. 


1 Introduction 

Various scientific disciplines use computations involving real numbers to model 
and reason about different phenomena in the world. Real numbers are typi¬ 
cally approximated by floating point numbers in scientific calculations. Round¬ 
off errors are inevitable in such approximations and they might build up into 
catastrophic errors in some cases. Exact real arithmetic approaches address 
this issue by devising computation procedures that given an expression and a 
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precision requested by the user produce an output that is guaranteed to meet 
the precision requirement. 

Several approaches to exact real arithmetic are based on iterative 

bottom-up calculations. Given an expression and a desired precision, bottom-up 
approaches typically start with calculating the inputs with an arbitrary precision 
higher than the requested precision. Then, the sub-expressions are evaluated in 
a bottom-up way. After evaluating the sub-expressions of each level, the guar¬ 
anteed precision is passed to the higher level. These calculations proceed until 
the main expression is calculated and its guaranteed precision is determined. 
If the precision obtained for the expression is not adequate, the computation 
restarts with increased precisions for the inputs. 

In contrast, various authors 0115] have proposed top-down approaches to 
exact real arithmetic. In top-down approaches, the required precision of each 
sub-expression is determined based on the precision required for its immediate 
parent expression. For certain types of expressions, the required precision of the 
sub-expressions can be calculated immediately. However, some expressions may 
require to first obtain additional information about the magnitude of the values 
of their sub-expressions before determining their required precision. Thus, in 
general, it might be necessary to recompute certain expressions. 

The main benefit of top-down approaches is that they exploit the structure 
of a given expression to estimate the required precision of the sub-expressions. 
In this context, one would ideally like to have top-down approximations for alge¬ 
braic and transcendental functions such that 1) the approximations are proven 
to be correct and 2) iterative calculations are avoided unless they are proven to 
be necessary. 

In several studies 0E], proofs of correctness for algebraic operations are 
available. However, the arguments about transcendental functions provide lit¬ 
tle insight about the correctness of the approximations and the effect of these 
operations on precision. Taylor expansions are the most prominent way to ap¬ 
proximate transcendental functions. Calculations with Taylor expansions are 
typically restricted to a base interval; range reduction identities are used to 
extend the computations to the complete domain of a function. Proofs of cor¬ 
rectness for transcendental functions are limited to the base interval [Misiinj, 
whereas little attention is given to the general case where the computations 
introduced by range reduction identities influence the output precision. 

The second desired property for a top-down approach is related to the itera¬ 
tive nature of the computations. As discussed above, bottom-up and top-down 
approaches rely on iterative computation schemes. However, no formal rea¬ 
soning is provided to prove that such iterative calculations are essential in the 
approximation procedures. 

In this article, we consider a simple representation for real numbers and 
propose a top-down approach to approximate various algebraic and transcen¬ 
dental functions with arbitrary precision. For each operation, we describe an 
approximation procedure and relate the precision of the inputs to the guaran¬ 
teed precision for the output. To guarantee the correctness of our top-down 
approach, we provide detailed proofs of correctness for the proposed approxi- 
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mations. 

To identify computational problems that require iterative calculations in our 
top-down approach, we have developed a perturbation analysis method. Our 
analysis describes the influence of errors in the inputs of a computational prob¬ 
lem on the output precision. We apply perturbation analysis to show that our 
approximation procedures only recompute expressions when this is unavoidable. 

Overview We discuss different approaches to defining computability of func¬ 
tions in Section In Section |3] we introduce a representation for real numbers 
and specify the syntax of the expressions that we consider in our computa¬ 
tions. To analyze computational problems in this setting, a perturbation anal¬ 
ysis method is introduced in Section 0] In Section [S] we discuss our approxima¬ 
tions of algebraic operations. We approximate transcendental functions using 
Riemann sums and Taylor expansions in Section [6] and [7l respectively. Section[8] 
contains discussions about related work. In Section IH] we draw some conclusions 
and suggest directions for future research. 

2 Computable Real Functions 

Real arithmetic is concerned with performing computations on real numbers. 
In order to do calculations with real numbers, it is necessary to define what 
it means for an operation to be computable. In this section we briefly discuss 
different approaches to defining computability. 

Since real numbers are infinite objects, one can use infinite streams from 
a finite alphabet E to represent them. This gives rise to a definition of com¬ 
putability called Type-2 Theory of Effectivity (TTE, [21]). In TTE computable 
operations are defined in terms of functions / : that receive infinite 

words as input and produce infinite words as output. An essential property 
of a Type-2 computable function is the finiteness property m- This property 
indicates that for a computable function /, any finite prefix of the output f{x) 
is computable by some finite portion of the input x. 

An alternative definition of computability has been introduced by the Rus¬ 
sian school of constructive analysis mui]. In this definition, computable op¬ 
erations are defined in terms of functions / : E* —>■ E* that receive finite words 
as input and produce finite words as output. This approach is sometimes re¬ 
ferred to as Type-1 computability. A function / is Type-1 computable if there 
is a Turing machine that transforms any finite input string a; G E* to the finite 
output strings f{x) € E*. Type-1 machines provide a natural way to define 
computability on, for instance, rational numbers and finite graphs. 

In both Type-1 and Type-2 approaches, E depends on the concrete represen¬ 
tation that we use for input/output objects. For instance, one can use the binary 
signed-digit representation to represent real numbers in the inputs/outputs of 
computations. 

The relation between Type-1 and Type-2 computable functions has been in¬ 
vestigated in various studies. It is known that restricting the domain of a Type-2 
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computable function to finite streams results in a Type-1 computable function 
HHl]. However, not every Type-1 function can be obtained by restricting some 
Type-2 computable function [Bin]- To illustrate this, we consider a function 
/i : {0,1}* —>■ {0,1}* defined as follows: 



where k G N is the length of the longest prefix of zeros in s and s' G S* is 
a finite suffix of s in the second case. The function fi performs computations 
on hnite strings and one can construct a Type-1 Turing machine to compute 
this function. By extending the domain of fi to infinite strings, we obtain 
/2 : —>• such that: 



The function /2 is not computable with a Type-2 Turing machine; it is not 
possible to write 0 in the output after reading a finite prefix from the input. 
The interested reader can refer to [1] for more details about the relation between 
Type-1 and Type-2 computable functions. 

In addition to Type-1 and Type-2, one can also consider a third approach 
to dehning computability based on certain finite structures that provide precise 
descriptions for specific classes of real numbers. For instance, Lagrange’s the¬ 
orem on continued fractions indicates that the real numbers whose continued 
fraction is periodic are the quadratic irrationals. Based on this observation, one 
can define computability in terms of functions / that given a finite and precise 
representation of x produce a finite and precise representation of f{x). 

The exact real arithmetic approach that we introduce in this article is based 
on Type-2 computability. In Section [3] we discuss a representation for real 
numbers in terms of rational numbers that are coupled with a notion of precision. 
Our approximations for arithmetic operations rely on the finiteness property 
of computable functions. Thus, for a given computational problem, a desired 
precision for the output is obtained based on sufficiently good approximations 
of the inputs. 

3 Real Numbers: Representation &: Operations 

In this section we first discuss our representation of real numbers and then 
describe the syntax of the expressions that we aim to calculate in our setting. 

Since real numbers are infinite objects, a finite representation of an arbi¬ 
trary real number x can only approximate x with a certain precision. In sci¬ 
entific measurements and calculations, the amount of error that we commit in 
approximations is measured by an absolute or relative error. In practice, an 
absolute error is of little use. Since numbers tend to have very different orders 
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of magnitude, it is the relative error that shows the significance of the lost digits 
in measurements or calculations. Hence, in our setting we use a representation 
based on the relative error. 


Definition 1. A real number x is represented by a tuple {m,n,p) such that: 


, m, ,m, 1 
a;- - < - 
n n 2P 


where m,n G’Z \ {0},p G N. 

The representation (m,n,p) for x means that ^ approximates x and the 
relative error of this approximation does not exceed 

In this article, we focus on calculating expressions that can be described 
with the following grammar: 

E ::=c| -E\E-E\ 1 \ E + E \ Ve \ \ 

ln(£') I arctan(£') | cos{E) \ sin{E) 


where c represents a rational constant. 

The following identities show that other interesting operations can be de¬ 
scribed in terms of the operations of this grammar: 


tan(a;) 

cot(x) 

arcsin(a:) 

arccos(x) 

arccot(a;) 


sin(a;) 

cos(x) 


1 


tan(a:) 

arctan( 

arctan( 

arccos(- 


x 

\/l — x"^ 
'Jl — x"^ 

X 

X 

\/T+^ 


) 

) 

) 


4 Sensitivity of Operations to Perturbations in 
the Arguments 

Our goal is to develop a top-down exact real arithmetic approach based on the 
representation of Definition [T] Thus, for a given computational problem, it is 
essential to estimate the required precision of the inputs based on the desired 
precision in the output. Moreover, we would like to investigate to what extend 
the operations of grammar © can be calculated in a top-down manner without 
iterations. 

To analyze the operations of grammar o, we introduce a pertubation anal¬ 
ysis method for measuring the sensitivity of the operations to perturbations 
in their arguments. We consider two general cases in our analysis. First, we 
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consider a function f{x) in one variable and show how errors in the input in¬ 
fluence the output fSection 14.11) . Then, we consider a function f{x,y) with 
two arguments and investigate the effect of errors in the inputs on the output 
fSection I4.2|) . 

4.1 Perturbation Analysis for Unary Functions 

Let f{x) be a differentiable function that we want to calculate in point x = a. 
Suppose that Aa is a perturbation in the argument a. The relative error in the 
calculation of /(a) caused by this perturbation is: 

f{a + Aa)-f{a) 

' /(«) ' 

We want to relate this relative error to the relative error of the argument, namely 
I — I . To this end, we use the following approximation of the function / in point 
X = a + Aa: 


f{a -f Aa) Ki f(a) + f'{a)Aa 

We can approximate the relative error of / as follows: 

, /(g -f Aa) - /(g) , _ , /'(a) Aa , _ , a/'(a) ,, Aa, 

' /(«) /(«) ' ' /(«) " a ' 

From equality ([5]) one can see that the quantity | | determines the effect of 

the relative error j-^l on the output. In numerical analysis and linear algebra 
the quantity \ is usually referred to as the condition number oi f{x) [SJ[B]- 

4.2 Perturbation Analysis for Binary Fnnctions 

Let f{x,y) be a differentiable function that we want to calculate in point 
{x,y) = {a,b). Suppose Aa and Ab are perturbations in the arguments a and 
b, respectively. The relative error of /(a, b) caused by these perturbations can 
be calculated as follows: 

J{a +Aa,b +Ab) - f{a,b) ^ 

77 To 

f{a,b) 

To find an upper-bound for ©, we use the following first-order approximation 
of the function / in point {x, y) = (a + Aa, b + Ab): 

f{a + Aa, b + Ab) r; f{a, b) + U(a, b)Aa + fy{a, b)Ab 

where fx{o-, b) and fy{a, b) are the partial derivatives of / with respect to x and 
y in {a,b), respectively. We relate the relative error in the calculation of / to 
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the relative errors j-^l and j-^l as follows: 


f{a + Ag, b + Ab)- f{a, h) 
f{a,b) 


f^{a,b)Aa + fy{a,b)Ab 


< 


fia,b) 

|a/x(a, &) II Aa, ^ ^bfy{a,b) ^^Ab 

b ' 


f{a,b) 


f{a,b) 


< 


(I 


afxia,b) bfy{a,b) 


Aa, ,Ab 


ft h\ ' ' ' f! l)™ax{| —1,1^1} 
J{a,b) j{a,b) a b 


( 4 ) 


The quantity | I 4" I I determines the upper-bound calculated in 

inequality (|4]) and we use this quantity to measure the effect of erroneous ar¬ 
guments on the output. It should be noted that in inequality o we have 
considered | ^ | and | ^ | as independent factors that can influence the relative 
error of /. This way of reasoning about the sensitivity of f{x,y) is related to 
componentwise analysis of perturbation in numerical analysis and linear algebra 
[7], which we use in this article. 

Another possibility is to relate the relative error of ([3|) to the quantity: 



Aa 



Ab 



a 



[5J 



This type of analysis is usually referred to as normwise analysis of perturba¬ 
tion [7]. 

In the following sections, we provide a top-down approach for approximating 
various algebraic and transcendental functions. Perturbation analysis will be 
used to show that in our approximations we only recompute expressions when 
essential. 


5 Approximating Algebraic Operations 

In this section we calculate the algebraic operations of grammar o using a top- 
down approach. We formulate and prove theorems that allow us to calculate ex¬ 
pressions involving unary negation ('Section l5.ll) . multiplication ('Section l5.2l) . in¬ 
verse (Section [531), addition fSection [5^ . and square root fSection (531) . Based 
on the theorems, we provide different implementations of COMPUTE(expr, p) to 
calculate algebraic operations. These implementations receive an algebraic ex¬ 
pression expr and a desired precision p and produce an output with the desired 
precision. In each case, we also apply the perturbation analysis of Section|3]and 
show that we avoid unnecessary iterations in our approximations. 

5.1 Unary Negation 

Theorem 5.1. Let x be a real number represented by (rn,n,p). Then —x can 
be represented by {—m,n,p). 
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Proof. Since x is represented by (m,n,p) we can write: 


m 

n 

m 

n 


, m, 1 m ,m, 1 

'ir'i; ^^ TT + 

, TO , 1 TO , TO , 1 

“ ^ <-a;< — + - 

n 2P n n 2P 


Thus, we can represent —x by {—m,n,p). 


□ 


Algorithm [T] applies Theorem IS.ll and approximates —x based on a represen¬ 
tation {m,n^p) of X. To confirm that —x can be approximated with arbitrary 
precision in one pass, we calculate | \ for the function f{x) = —x: 

, xf'{x) ^ , (x)(-l) ^ 
f{x) -X 

The quantity | \ is small and independent of the argument x and hence the 

amount of precision that we lose in unary negation (which is 0 in the case of 
Theorem I5.ip can be calculated independently of x. 


Algorithm 1 Unary Negation 

Require: expr has the shape —x 
1; procedure COMPUTE(ea:pr,p) 
2: ^ ^ COMPUTE(x,p) 

3: return 

n 


5.2 Multiplication x ■ y 

Theorem 5.2. Let x and y be two real numbers represented by {m,n,p) and 
(to', n' ,p), respectively. Then x ■ y can he represented by (mm', nn',p — 2). 

Proof. From Definition [T] we can write: 


TO 

n 


TO 


/ 


n' 


, m, 1 m ,m, 1 

- <a:< -+ 

n 2P n n 2P 

m' . 1 m! I 1 

„/ I 2P < I 12P 


(5) 

( 6 ) 


We consider three cases: 

1. Suppose > 0. We can multiply inequalities ([S|) and (O as follows: 


mm! _ 1 , n mm! _ 1 











li X ■ y can be represented by (rnm\nn',p — 2) then it must be the case 
that: 


mm!, 1 . mm', 1 , 


2p- 

To see that this is valid, we need to show that: 

mm' J_ 2 ^ tW _J_ 

nn' ^ 2P’ - nn' ^ 2p- 
mm', 1 , mm', ^ 1 


r(i - 2?=2> S —(1 - 


2P' 


But > 0 and from Proposition [T] (see we know that both inequal¬ 

ities hold. 


TO 1 , TO 1 , 

- 1 - < 2^ < - 1 + 
n 2P n 2P 

to' _ 1 , to' , 1 , 

“(1 + 7^) < y < ~ 

n 2P n 2P 


2. Suppose ^ > 0, ^ < 0. We can rewrite inequalities ([5]) and ([H]) as follows: 

(7) 

( 8 ) 

Multiplying inequalities dZD and dS]) we get: 

If a; • y is representable by {mm', nn' ,p — 2) then it must be the case that: 


, 1 , mm' 1 , 

r(l + 9„_2 ) < ^ ' y < , ( “ 9 P- 2 ) 


nn’ ' 2P“2 nn' 2 p 

To show that this is valid, it suffices to prove that: 

mm! 1 \2 ^ mm! 1 

T1 “ ^1 — V “ 9P- 


nn 

mm '1 , ^ , 

—r(i + - —r(i + w) 

nn 2P ^ nn 2 p 


nn’ ' 2P~"^' 

mm '1 


But < 0 and from Proposition [T] (see we know that both inequal¬ 
ities hold. 

3. Suppose ^ < 0, ^ > 0. This case can be proved similarly to the second 
case. 


□ 

Algorithm [2] depicts an approximation of a; • y based on Theorem 15.21 Given 
approximations of x and y, this algorithm approximates x ■ y ra. one pass; the 
loss of precision is predictable without additional information about x and y. 
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Algorithm 2 Multiplication 

Require: expr has the shape x ■ y 
1: procedure COMPUTE(ea:pr,p) 
2: ^ ^ COMPUTE(x,P + 2) 

3: ^ ^ COMPUTE(j/,P + 2) 

4: return ^ 

7 Tn' 


To confirm this claim, we apply the perturbation analysis of Section |4] on the 
function f{x, y) = x ■ y: 

j{x,y) j{x,y) x.y x.y 

The sensitivity measure is small and independent of the arguments. Thus, in 
a top-down approach, x ■ y can be approximated in one pass without iterative 
computations of x and y. 


5.3 Inverse 


Theorem 5.3. Let x be a real number represented by {m,n,p). Then - can be 
represented by (n,m,p— 1). 

Proof. We consider two cases: 

1. Suppose ^ > 0. Since x is represented by {m,n,p) we can write: 

m 1 , m 1 , 

n 2P n 2P 

n 2P I ^ n 2P 
m2P + 1^ ^ m2P - 1 

If - is representable by (n, m,p — 1), then it must be the case that: 


1 ^ 1 lx 

- (1 ~ ^„-l ) < - < - (1 + r,„_l ) 

m /P -t nr m 


n 

(1 - 

m 2P~‘-' x m 
To see that this is valid, it suffices to show: 


n,2P n 1 , 

m 2P — 1 m 2P ^ 

rrP 2P-1 ' “ to^2p -b 1 

But ^ > 0 and hence both inequalities follow from Proposition El fsee El) . 

2. Suppose ^ < 0. Since x is representable by {m,n,p) we have: 

TO _ 1 , TO 1 , 

n 2P n 2P 

n 2P 1 n 2P 

m 2P — I ^ X ^ m 2P + 1 
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If i is representable by {n,m,p— 1) then it must be the case that: 


-(1 


2P 


1^1 n 1 

tt) < - < -(1- 


2P- 


To see that this is valid, we need to show: 


2P 


(-) < —(1-r 

m^2P + 1' ~ m 2P-1 


2P 


n 1 1 ^ ^ n , 

rrV ^ “ to^2p - 1 


) 

) 


But ^ < 0 and hence the inequalities follow from Proposition [5] (see |X|. 

□ 


Algorithm 3 Inverse 

Require: expr has the shape ^ 

1: procedure COMPUTE(ej:pr,p) 
2: ^ ^ COMPUTE(x,P + 1) 

3: return — 


Algorithm [3] approximates ^ with precision p based on Theorem 15.31 Given 
an approximation of x with precision p, the algorithm allows us to approximate 
i in one pass. We use perturbation analysis and calculate the quantity 
for f{x) = i to show that loss of precision in the inverse can be estimated 
independently of the argument: 

l^/^l ^ iMiMll = 1 
' m ' ' (i) ' 

The quantity is a constant and hence iterative computations can be 

avoided when calculating the inverse. 

5.4 Addition 

Theorem 5.4. Let x and y be two real numbers represented by (rn,n,p) and 
{m',n',p), respectively. The value of x + y can be approximated as follows: 

i. If > 0, then x + y can be represented by (mn' + m'n,nn',p). 

ii. If HH" <; 0 and i G N+ is the smallest natural number such that i > 

ii- 

log 2 (— m I ^ 1 ) ); thenx+y can be represented by [mn'-\-m'n,nn',p—i). 
““(I ^ I’l ^ I) 
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Proof. 


i. For numbers x and y we can write: 


m 

m 

, 1 




m 


m 

1 

(9) 

— - 

- — 


< 

X 

< 

— 

+ 1 

— 

— 

n 

n 

1 2P 




n 


n 


m' 

1 m' 

|1 

< 


< 

w! 

+ 

. m' 


(10) 

n' 

n' 

' 2 P 

y 

n' 

177 

■ 1 - 

'' 2 P 


From inequalities © and (fTUl) we can write: 


,m m\ 1 ,,m, ,m m\ 1 ,m', 

<Tr + - 2J<i7ri + irfi) <- + !'< (tt + 77> + +1^1 


(11) 

If a; + y is representable by (mn' + m'n,nn',p), then it must be the case 
that: 


,m m\ 1 , TO to', 

(- + —)- + —I <a(- 

n n 2^ n n 


,m m\ 1 ,m m' 

■y<{- + —) + 7^\- + —\ 

n n zr n n 


To show that this is valid, we need to prove that: 


, TO to' . 1 

(-^— 7 ) + VI 

n n 2^ n 


m, ,ra ,m m \ 

/VO' n nrt' 


'' n 


m 

n' 


1 . m 
^ n 


m 

n' 


'' n 


m 

n' 


1 

TO to' , 

( 12 ) 


- + — 1 
n n' 

TO 

, , Tfl' ,, 

( 13 ) 

( 1 - 

n 

i + i— 1 ) 

n 


The rational numbers — and ^ have the same sim. Therefore, I — + ^| = 
1^1 + 1^1 holds and inequalities (HU and m are valid. 

ii. If a: + y is representable by {mn' + m'n, nn',p — i), then it must be the case 
that: 


.TO to' , 1 , TO to' , 

(- + —)-bli^|- + —I <2^- 

n n 2^ ^ n n 


,m m\ 1 ,m m' 

■y<i- + —) + T^\- + —\ 

n n 2^ ^ n n 


To show that this holds, we need to prove the following (see inequality (ITT|) 1: 


.TO to' , 1 ., TO, , to' ,, , TO to' , 1 , TO 


TO to' 1 TO to' To to' 1 To 

n n' 2P“* n n' “ n n' 2p n 


n' 

rvn' 


For both inequalities, it boils down to proving the following: 


1 


n 


/•' T) Yl' 


n' 


(14) 


To prove inequality (I14L we consider the following two cases: 
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1. Suppose ^ > 0 and ^ < 0. We can rewrite inequality (HI as follows: 


, m m', 1 ,m m!. ,m m\ 1 ,m m'. 

- + —I > -r) ^(- + —) > -r)v 

n n' 2^ n n n n 2® n n 

,m m' 1 , m' m 

n n' “ 2* n' n 


In other words, we should show or Depending 

on the values of I —I and 1^1, both cases follow from ^ > 

I n I I n' l> min(|^|,|^|) “ 

min(|^|,|^|) 

Equivalently, we should have i > log 2 (—)• 

2. Suppose ^ < 0 and ^ > 0. We can rewrite inequality (ITU) as follows: 


, m m' 1 ,m' m, , m m', 1 ,m' to , 

■ n + “ ^^77 “ n 

,TO to' 1,TO to'. 

n n' “ 2® n n' 


In other word, we need to prove or Depending 

on the values of I —I and 1^1, both cases follow from > 

In' I n' I’ iiiin(|^|,|^|) “ 

min(|^|,|^|) 

Equivalently, we should have i > log 2 (— T",') )■ 

□ 

Algorithm |4] applies Theorem 15.41 to approximate x + y with precision p. 
If approximations ^ and ^ have the same sign, we do not lose precision by 
calculating x + y. On the other hand, if ^ and ^ have different signs, the 

amount of precision that is lost depends on the magnitude of ■ 

This indicates that if < 0 and | — | Ri | ^ |, a significant amount of precision 
can be lost in x + y. Thus, if the guaranteed precision for x + y (i.e., p — i) is not 
sufficient, x and y must be recomputed with higher precisions (see Line l3II3] in 
Algorithm |4]) . 

To confirm this observation, we apply the perturbation analysis of Section [4] 
on f{x,y) =x + y: 

, xMx,y) yfy{x,y) ^ , x y 

f{x,y) f{x,y) x + y' x + y' 
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Algorithm 4 Addition 


Require: exyr has the shape x -I- y 


1 

procedure COMPUTE(expr,p) 


2 

dp ^ p 


3 

repeat 


4 

^ ^ COMPUTE(x,dp) 


5 

^ ^ COMPUTE(y, dp) 


6 

if W > 0 then 

nn' 

\> Theorem |5.4|i| 

7 

return 

nn' 

> Theorem 15.4liil 

8 

9 

else 

1 ^ riog2( -^— ^7 — )1 

10 

11 

if dp — i > p then 
return 

nn' 


12 

else 


13 

dp ^ dp -I- 1 


14 

until true 



The quantity + ly^l is 1 when xy > 0. Thus, we can estimate the 
loss of precision in x + y independently of the arguments when xy > 0. 

However, the quantity ly^l + ly^l can be arbitrarily large when xy < 0 
and |a; +j/| ~ 0. This confirms that a significant amount of precision can be lost 
in a: + y. 

Perturbation analysis shows that in general, we cannot estimate the amount 
of precision that is lost in a; + y independently of the arguments. Hence, if 
approximation ^ and ^ have different signs, recomputing x and y might be 
essential to obtain the desired precision for a; + y. Loss of precision in a; + y is 
sometimes referred to as loss of significance |1] or catastrophic cancellation [5]. 

It should be noted that Theorem 15.41 does not imply that a; + y is always 
fundamentally problematic when xy < 0 and |x + y| « 0. In certain cases, the 
calculation can be adjusted in such a way that loss of significance can be avoided 
and the expression can be calculated in one pass. 

Suppose we want to calculate \/x + \ — ^fx for a relatively large x. Since 
•v/x + 1 ~ \/x, we will lose a significant amount of precision if we directly cal¬ 
culate -v/aH-T — -y/x. However, we can change the calculation algorithm by 
rewriting the expression as follows: 


■v/xTT — \/x = (\/x -I- 1 — '/x) X 


■v/xTT -I- sjx 
\/x -t-1 -t- \fx 


1 

\/x -t-1 -t- \fx 


In the new expression, all the operations can be approximated with a desired 
precision in one pass (see Section [53] on calculating square root). Hence, we can 
approximate the new expression without recomputing the sub-expressions with 
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higher precisions. Applying the perturbation analysis of Section S] also shows 
that f{x) = — y/x is not fundamentally problematic: 

: Xf'ix) _ , ^(2Vx+l ~ I _ 1 I X 1 

' f(x) ' ' yJTT-yi ' 2]/x+l 2 

Using perturbation analysis, we can identify instances of x + y where adjust¬ 
ments in the algorithm can avoid loss of significance. However, to our knowledge 
a general scheme for making such adjustments does not exist. 

5.5 Square Root 

In this section, we calculate i/x by approximating the root of /(y) = y^ — x 
using the Newton-Raphson method [9]. The Newton-Raphson method starts 
with an initial approximation yo for i/x and iteratively generates a sequence of 
approximations. Assuming that the precise value of x is available, the sequence 
of approximations is generated by: 


yn+i — 


yl + x 

2'yn 


In what follows, we prove a theorem for approximating ^/x by the Newton- 
Raphson method when an approximation {m,n,p) of x is available. 

Theorem 5.5. Let x be a real number represented by {m,n^p) such that: 


m 


= O.bi ... X 2“ 


where bi € {0,1} for 1 < i < k and &i = I,a G Z. Then y/x can be represented 
by {m',n',p — 4iV) where N = |’log2(log2(2P+^ -|- 1))] -|- I and ^ is the N-th 
term of the following sequence: 


Vn+l — 




2j/r! 


yo 


Proof. From Definition [T] we write: 


I , TO, I , 

— ~ < a; < —(1 -I- —) 

n 2P n 2P 

To prove the theorem, it suffices to show: 

\Vx-yN\ < 

We rewrite the left hand side of inequality m- 


(15) 


iVx-yNl = Wx- ZN + Zn - ywl < Wx- Zn\ + \ZN - yN\ (16) 
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In inequality (fTBl) . zat is the iV-th term of the following sequence: 


= 4±£,.. = 2rfi« 


To prove inequality (II3, it suffices to show that the following inequalities hold: 


Wx- Zn\ < ^p_4Af+i \yN\ 


\zn - yw| < 


1 


-IvnI 


2 p—4A^+1 

Proof for inequality (flTl) : First, we show that j/„ > for all n € N: 


( 17 ) 

( 18 ) 


_ lx yl., + f yl-i + ^ jm 1 + f 

'^2 2y„_i \2 2y„_i V2n^ 2^^“ 2j/„_i 


yl., + ^ - 2y„_i 

2yn—1 


iVn-1 - 

2yn—1 


> 0 


Observe that as yn > \f\-, inequality (13 is valid, if the following inequality 
holds: 


Iv/ir - ZAr| < 


1 


2P-4Ar+| 

To prove inequality (fT^ . we find N such that: 


\fx 


Zn — \fx = 


1 


2P+2 


\/x 


( 19 ) 


( 20 ) 


We calculate the quantity 

^ ZN—yJx 


_ „ — 

yjx _ 2ziv_i _ ^:v-i + X + 2Zjv_iVx 

ztq — \fx ^N- 1 +^ rr z\X - 2z]S! 

22 ^^_^ V 


22iV-l 


, ZAT-I + \fx , 


_^ —1 I _^^U”rv*^ ^2^ 


,Z0 + \fx 


( 21 ) 


'ZAT-l-y^ Zo-y/x 

Suppose equality dini) holds for N. We can rewrite the right hand side of 
equality m as follows: 

^ ^ I A / ;/; . „ 

1 ( 22 ) 


^ ZQ + y/x ^ 2 ’'^ _ ZN + y/x _ (2 + y+j)y/x _ 

^Zo-y/x ZN - y/x {^^)y/x 


The index N that guarantees the required precision of equality (Uni) can be 
calculated from equality (1^ : 


2^l0g2(^^±^)=l0g2(2P+3 + l) 

Zo - y/x 

N = log2(log2(2P+3 + 1 )) - log2(log2( ^° ~^ ^ )) 

Zq - y/x 


( 23 ) 
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To guarantee that N is well-defined, we show that zq > x. To prove the in¬ 
equality, we use the assumptions ^ = O.&i... 6fc x 2“, G {0,1} for i = 1,..., /c 
and bo = l,zo = 


'/x < \ —(1 + ^) < VO-bi ■ --bk x 22 x \/ 2<22 xv ^< 

V n 2P 


zo 


To obtain an estimation for N from equality (l2^ . we calculate an upper bound 

2rt1+i 2^+^ , 2y/2 


for 

\/x 


Zq 

\fx 


< 


< 


VO.fo ... X 2t X 


< 




X 


4 

?2 


In the worst case, the initial approximation zq differs from i/x by a factor 
We use this estimation in equality (1^51) to calculate the number of iterations for 
the Newton-Raphson method: 

4= + 1 

N = l0g2(l0g2(2^'+3 + 1))- log2(log2(^i—)) 

<riog2(log2(2^’+3 + l))l +1 (24) 

Proof for inequality (fTSl) : To prove inequality dUD, we consider the calcula- 

I X 

tions in zat = n ~^ — and estimate the amount of error that we commit in the 

Zzjv-l 

■ Pw-l + W 

approximation yjv = 2 yN.i ■ 

Let P(k) denote the amount of precision that we lose when we approximate 
Zk by Uk- Thus, we lose P{N — 1) units of precision if we approximate ^at-i by 

UN-l- 

1 


\ZN-1 - VN-ll < 


2p—P(iv—1) 


lUN-l 


The precision is reduced by 2 units when z%_i is approximated by y%_i (see 
Theorem 15.21) : 


\z^N-l Vn-iI < 


1 




2p-PiN-l)-2 I 

We approximate -f a: by y%_i + We do not lose precision in this 

approximation (see Theorem 15 .dlil) : 

\{z%_i -fa;) - {y%_i + —)\ < 2p_p(7v-i)-2 + “I 

Given the approximation j/w-i of zat-i, one unit of precision is lost in the 
approximation of ^ (see Theorem 15.31) : 

,1 1 , 1 , 1 , 


< 


ZN-l yN-1 2P b 1 l/Af-l 

1 1 , 1,1 

< 


2zn-i 2yM-i 2P ^ 2?/Ar-i 


(26) 
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Finally, we approximate 


1 +x 


2zn- 


based on the approximations described in in¬ 


equality (ESI) and (1^ (see Theorem 15.21) : 


Izn - ml = 




Pn-1 


< 


1 


iVn-i+ 


2z]v-i 2 ?/ 7 v-i ' 2P 2yM-i 

From inequality (1271) we obtain the following recursive formula: 

P{N) = P{N - 1) + A 


(27) 


Since po = Zo = 2rtl+i, we lose P(IV) = P(0) + AN = 4iV units of precision 
in our approximation of zat. We apply the number of iterations calculated in 
inequality (|24ll and obtain: 

P{N) =AN< 4riog2(log2(2P+3 + 1))] + 4 


□ 


Algorithm 5 Square Root 


Require: expr has the shape ^/x 
1: procedure COMPUTE(ea:pr,p) 

2: Choose px such that Px > P -f 4|’log2 (log2(2P‘*+^ -f 1))] -f 4 

3: A ^ |■log2 (log2(2P»+3 -h 1))] -f 1 

4: ^ ^ COMPUTE(x,Pa;) 

5: if < 0 then 

n 

6: “Undefined operation” 

7: else 

8: [> ^ can be represented as 0.6i... 6fc x 2“ 

9: > bi € {0,1} for I < i < k,bi = 1 and a G Z 

10 : a ^ Llog 2 (^)J + 1 

11: 1/ot-2rtl + l 

12: for z = 1 to A do 

, 3 , . .La- 


14: 

15: 


Vi A 

return — 


2Vi-i 


Algorithm [S] applies Theorem 15.51 to approximate ^/x with precision p. As 
indicated in Theorem 15.51 loss of precision in the square root can be estimated 
independently of the argument x and hence Algorithm [5] approximates ■\/x in 
one pass. We apply perturbation analysis on f{x) = ^Jx to show this: 

a/'( a;) I 

' /(x) ' “ ' ^ " 2 

The quantity is a small constant. Thus, in our top down approach, we 

can approximate ^/x without iterative computations. 


18 
















Figure 1: Approximating 


6 Approximating Transcendental Functions by 
Riemann Sums 

In this section we introduce approximations for (Section 16.11) . ln(a;) (Sec¬ 
tioning, and arctan(a:) ^Section 16.31) . We use Riemann sums to approximate 
these functions. 

The COMPUTE(expr,p) function introduced in Section [5] will be extended 
to allow the approximation of e“,ln(a:) and arctan(a;) with a given precision p. 
Perturbation analysis will also be used to identify computational problems in 
which iterative computations are unavoidable. 


6.1 Exponential Function 

To approximate the exponential function by Riemann sums, we first provide a 
simple approximation for where we assume that x is precise. Then we extend 
this calculation to approximate where x is represented by {m,n^p). 

Suppose X > Q. To calculate e“ we consider the curve y = e* and calculate 
the area enclosed by this curve and the t-axis between t = —x and t = x as 
follows (see Fig. 1(a) I: 



— e 


— X 


We use Riemann sums to approximate this area; Fig. |l(b)| shows two approx¬ 
imations from above and below using rectangles. Thus, we get the following 
inequalities for N rectangles: 


N-l 


E 


2x 

TV 




< 



i=l 


(28) 
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We rewrite inequality (051) as follows: 


N-l r, N 


2=0 




7\r/\2x J ^ ^ \ -i-j- J\ 2x ^ I 

Jy efr - I jy efr -1 

2x 2x 2x 2x 

— < e'^ — 1 < —e"^ 

N - ~ N 


(29) 


We assume that N > 2x and calculate an upper bound and a lower bound for 
from inequality (1291) : 


, 2x, N N 

1 + _)^ <e^ < (- 

^ - -^N-2x^ 


(30) 


We can estimate the precision of the approximations calculated in inequal¬ 
ity (1501) . For example, we can approximate by ( n- 2 x )^ absolute 

error of this approximation can be calculated as follows: 


N N N 

|(-)“ - e“| < |( 


K 2x,iV. 

- ) 2 - (1 -I-) 2 

N -2x’ ^ n’ ' 


--1 


= UtAtJ - (^)) E 


'N-2x' 


N 


i=0 


N 


_ Ax"^ ^ (' ^ 

N{N-2x) -2x’ ^ N ’ 

K-i 

4x^ X—^ ^ \ ^ 1 2x‘^ . N . N 

< - > (-)““ =-(-)“ (31) 

- N{N-2x) ^^N-2x’ N^N-2x’ ^ ' 

For the last inequality we apply ■ 

In the discussion above, we have treated a; as a precise value. In the following 
theorem, we extend this calculation and describe an approximation of that 
relies on a representation (m,n,p) of x. To simplify our approximations, we 
first assume that | —| < 1. Afterwards, we extend our approximations to an 
arbitrary (to, n, p). 


Theorem 6.1. Let x be a real number represented by {m,n,p) and |^| < 1. 
Suppose N is a natural number such that N > 2"^^. The value of can be 
approximated as follows: 

i. If 0 < ^ < 1 then can be represented by {m',n',p — 2|’log2(^)] — 3) 
where ^ . 


ii. 


//-I < ^ 
where ^ = 

n 


< 0 then 
1 


N 

TH 


-FT' 


can be represented by {m',n',p— 2 |’log 2 ('y)] 


4) 
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Proof. 


i. Suppose 0 < — < 1. Since x is represented by {m,n,p), we have: 


„ m , 1 , m,, 1 , 2m 

1 w ^ 1 w ^ < 2 

n 2P n 2P n 


To prove the theorem, it suffices to show that: 

|e - ( .r 2^. ) M < 


N N 


^_2m^ I 2P-2riog2(f )l-3_ 2m 
n ^ n 

We rewrite the left hand side of inequality (|33l) as follows: 

IX ! ^ \^\ \ X / ^ ^ ^ 

H = |e" - ^ ^ - ( 


(32) 


(33) 


'■ N -2SL- 


^N-2x' 


'_/V _ 2m 


, r / ^ iV N _N N 

\ ^ I 

< Ie" - 


'N-2x' 


'N-2x' 


"TV _ 2m ' 


To prove inequality (IMll . it suffices to show that the following inequalities 
hold: 


-( 


N 


N -2x 


I 

) H < 


1 


N jv 

) ^ I 


N N N N 

2m )n < 


2P-2nog2(f)l-2 '^N - 2™ 

n 

1 


N 


'N-2x 


_ 2rn/ I 2 P- 2 ri°g 2 (f )l-2 '^iV - — 




(34) 

(35) 


Proof for inequality (IMl) : From inequality (I5T]) . we obtain an upper- 
bound for the left hand side of inequality (I34|) : 


N s w 2x‘^ N .K 
|e - ( o ) H < ^(- 


'N-2x' 


N ^N-2x' 


(36) 


To calculate an upper bound for the right hand side of inequality (l36l) . we 
consider the function f{x) = n- 2 x ')^ calculate its derivative: 

,, , 4x N N_ 2/ ^ > 

f “ 'N^N-2x^ " ^ -2x^ " ^(7V- 2a;)2^ 

From inequality (1^ we obtain x G (0, 2). We choose: 

TV > 4 > 2a: (37) 

to ensure that f[x) is increasing for x G (0,2), i.e., f'{x) > 0. We rewrite 
inequality (155)) as follows: 


TV N 8 N N 


^N'^N -A' 


(38) 
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We calculate a lower bound for the right hand side of inequality (IMl) as 
follows: 


1 

2p-2nog2(-f )l-2 


AT _ 2m 1 " I > 2P-21oS2(f)-2 

n 


(39) 


Thus, to prove inequality (IM)) it suffices to show that the following inequal¬ 
ity holds (see inequality (1551) . (15^ 1: 

, 8 N ^ 1 

^ 2P-2i°g2(f)-2 

This is equivalent to the following: 

3-log2(fV)-by(log2(l-b-^^^)) < -p-f 21og2(y)-f 2 (40) 

We choose N > 8 and apply Proposition[3](seeEl) to obtain an upper bound 
for log 2 (l -b ^): 

^°S 2(1 + < (AT - 4) ln(2) ^ 

Based on inequality iQl),®!), it is sufficient to find an A^ > 8 satisfying: 

3 - log2(fV) + ( y)(^) <-P + 2 log2(y) + 2 (42) 

Inequality (1421) is equivalent to the following: 

, 16 
-log2(—) + ^^<-p-5 


From N > 8, we conclude < 4. Thus, we choose N such that N > 

/r,£±ii ON p P + ll 

max(2 3 ,8) = 2 3 . 

Proof for inequality ((35ll : To prove the inequality, we estimate the 
amount of precision that is lost when we approximate () ^ by ( y, \rn . ) 

The number x is represented by {m,n,p). Thus, we have: 


' n ' 2P' n ' 


Since N > 8, one unit of precision is lost when we approximate A^ — 2x by 
N — ^ (see Theorem l5.4lii|) : 


\{N-2x)-iN-—)\<^\N-—\ 
n 2P ^ n 
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Approximating j^^ 2 x n reduces the precision by one unit (see The¬ 
orem lOl) : 



1 

^ 1 

< 

1 

1 1 

N 

— 2x 

TV - 2™ ' 

to 

1 

to 

' TV - — 



n 



n 


N 

1 

< 

1 

1 ^ 

N 

— 2x 

2m 1 

2P-2 

' TV — — 


Finally, approximating reduces the precision by 2|’log2(Y)l units 

(see Lemma IT] in 1X1) : 

'^N - 2x’ " ~ ''TV- — "" ^ 2P-2r'°S2(f)l-2'''TV_ 2i!i’ " ' 

n n 


ii. Suppose ^ < 0. We use the following identity to calculate e^: 


We represent —x by (—m,n,p) (see Theorem 15.11) . Then, we apply the 
first part of the theorem and Theorem 15.31 to approximate e“^ and -4^, 
respectively. 

□ 


In what follows, we extend the approximations of Theorem 16.11 and calculate 
the exponential function for x represented by (rn,n,p) where |^| > 1. 

Theorem 6.2. Let x be a real number represented by (rn,n,p) and |^| > 1. 
Suppose k and N are natural numbers such that: 


'2’^n 


p+ii 

<l , N > 2 3 


The value of e® can be approximated as follows: 

i. // ^ > 0 then can be represented by {m',n',p — 2|’log2(^)] — 2fc — 3) 
where ^ 

ii. // ^ < 0 then can be represented by {m',n',p — 2|’log2(Y)l — 2fc — 4) 

itihpTP ^ i 

Proof. Since |^| > 1, we choose fc G N such that < 1. We use the following 
identity to calculate e“: 


e 


X 




(43) 


We approximate ^ by Since 2^ is a constant, we do not lose precision in 
this approximation. We apply Theorem 17.21 to approximate e^. This approxi¬ 
mation reduces the precision by: 
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• 2riog2(^)l + 3 units, if 0 < ^ < 1; 

. 2riog2(f )1 + 4 units, if -1 < ^ < 0. 

Suppose ^ is the approximation obtained for from Theorem 16.11 We 
approximate by ; we lose 2k units of precision in this calculation 

(see Lemma [T] in 

□ 


Algorithm 6 Exponential Function 

Require: expr has the shape e“ 


1 

procedure COMPUTE(ea;pr,p) 


2 

Choose N such that N > 2^^ 


3 

^p + 2[log2(f)l +4 


4 

repeat 


5 

f ^ COMPUTE(a:,Pa:) 


6 

if 0 < ^ < 1 then 

t> Theorem |6.1|i| 

7 

m' , { N \4- 

n’ ^ 


8 

return ^ 


9 

else if — 1 < ^ < 0 then 

> Theorem|6.l|ii| 

10 

m' j 1 


11 

return ^ 


12 

else 


13 

Choose A: G N such that < 1 


14 

if (f > 0)A 

t> Theorem 16. 2lil 

15 

(pa; - 2 |'log2(f-)l - 2fc - 3 > p) then 


16 

m’ , ( N 

n' ^ 1 N-‘^> 


17 

return ^ 


18 

else if (^ < 0)A 

> 'l'heorem|6.2|ii| 

19 

{px - 2|'log2(Y)l - 2A: - 4 > p) then 


20 

m' , 1 

n> ^ r N 


21 

return ^ 


22 

else 


23 

Px ^ Px + 1 


24 

until true 



Algorithm |6] implements the approximations described by Theorem 16.11 and 
[O to calculate e“ with arbitrary precision. Observe that when | —| < 1, 
can be approximated in one pass. However, when | —| > 1, loss of precision 
depends on the magnitude of |^|. Thus, recomputing x with higher precisions 
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Figure 2: Approximating ln(a;) 


might be necessary to compensate for the loss of precision caused by applying 
equality (1^ (see Line 141231 in Algorithm [6]) . 

To confirm that iterative computations are unavoidable, we apply perturba¬ 
tion analysis on f(x) = e^: 


xf'{x) 

fix) 



= X 


The quantity |a:;| can become arbitrarily large and hence approximating with 
precision p in one pass is not always possible. 


6.2 Natural Logarithm 

In this section, we first discuss an approximation for ln(a;) based on Riemann 
sums where we assume that x is precise. Then we extend this calculation to 
approximate ln(a;) where x is represented by (m, n,p). 

Suppose a; > 1 is a real number and we want to approximate ln(x). We 
consider the curve y = j (see Fig. 2(a) I and calculate the area enclosed by this 
curve and the t-axis between t = 1 and t = x. This area can be calculated as 
follows: 


dt 1 . X 
— = ln(a;) 

We use Riemann sums to approximate this area; Fig. |2(b)| shows how the area 
can be approximated from below and above using rectangles. Thus, we get the 
following inequalities for N rectangles: 



X — 1 
N 


N 


E 


1 

^ + jfix 


1 ) 


a: - 1 

< ln(x) < 


N 




1 + 


1 ) 
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This gives us an upper bound and a lower bound for ln(a;) and by increasing N 
we get more precise approximations. 

We can estimate the precision of our approximations. For instance, if we 
approximate ln(x) by 1 + » \x-i) absolute error can be estimated 

as follows: 


,x — \ 


N-l 




-ln(a;)| 


I X — i 

^ l + ^(x-l) 

1 /, _ i'l = (^~ 1)^ 
N ^ x’ Nx 


X — 1 
N 


E 




1 


1 )' 


(44) 


Up to this point, we have assumed that the precise value of x is available. 

In what follows, we formulate a theorem to describe an approximation of ln(x) 

based on a representation (m,n,p) of x. 

Theorem 6.3. Let x be a real number represented by {m,n,p) such that p > 1. 

i. If ^ > 1, then ln(a;) can be represented by {m',n',p — i — 4) where ^ = 

^ i_i_j_((*m)_i) ; N = |~2P~^ LrJi l > 0 ,'nd j is the smallest natural 

number such that j > log 2 ( ) holds. 

ii. // 0 < ^ < 1, then ln(x) can be represented by {m',n',p — j — 5) where 

^ ^ and j is the smallest 

natural number such that j > log 2 ( i_.^ ) holds. 

Proof. 


i. Suppose ^ > 1. The number x is represented by (m,n,p) and hence we 
can write: 


1 m TO 1 , TO1 , 3to 

To prove the theorem, we need to prove the following inequality: 

iV-l 


m _ ^ 


1 


N 


i=0 




< 


1 


- 1 


N-l 


2p-,_4I ^ El + ^(^_1) 


(45) 


(46) 
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We rewrite the left hand side of inequality (HSl) as follows: 


I Hx) 
I ln(a;) 

I ln(a;) 


m _ ^ 

n_ 

N 

a; — 1 
N 

a; — 1 
N 

a; — 1 
N 

.X — 1 
'' N 


N-l 

E 

A ^-1 

E- 

N-l 

E- 

2^0 

A ^-1 

E- 

N-l 

E 

2=0 


1 


14- 

i f m 
' N^n 

- 1 )' 


1 


1 + 

Tfi^- 

1 ) 


1 


1 + 

Tfi^- 

1 ) 


1 

1 

1 + 

n(x — 

1 )' 


1 



1 + 1 ^( 2 : — 1 ) 


m _ 

n _ 

N 


N-l 

E 

2=0 




< 


m _ ^ A ^—1 

n \ ^ 


N 




i=0 


To prove inequality (l46ll . it suffices to show that the following inequalities 
hold: 


Iln(x)-^E 


N 


< 


- 1 


i=0 


l + j^(a:-l)' 2P-2-3' N 


N-l 


Sr 

i=0 


1 


j_ j-C™ _ 1V 

^ AT V „ 


(47) 


,x — 1 


N-l 

E 


^ S 1 + ^(3^-!) 


1 - 1 

2 _ 

N 


N-l 


y —^ 

i=0 Af V n 




< 


1 , - 1 


A/-1 

E 


2P-a-3' N + 

2=0 ' N 1 n ' 


(48) 


Proof for inequality (1471) : We use inequality (|4^ and calculate an upper 
bound for the left hand side of inequality (ITTl) : 


I ln(a:) 




(49) 


i X 1) ^ 

To calculate an upper bound for , we consider the function /(a:) = 

and calculate its derivative: 


fix) 


Nx^ 


From inequality (HSl) we conclude that x G The function f{x) 

is decreasing (/'(a;) < 0 ) in the interval [ 5 , 1 ] and increasing {f'{x) > 0 ) 
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in the interval [1,^]- Thus, the maximum of f{x) for x S [^i is 
max(/(i),/(i^)). We use this to rewrite inequality (l4^ : 


I ln(a;) 


X 

N 


N-l 


E 


1 


1 + 1 ^( 2 ^ 


1 *^T77 

^|</(x)<max(/(-),/(—)) 

1 (3rn _ 2^')2 
= max( —) 


Nm) 

1 3777. ">2 

< max(-, ——-) = 

N{^y 


3m 

2Nn 


(50) 


We also calculate a lower bound for the right hand side of inequality (ITTl) : 


- 1 


N-l 

E 


2p-,-3. n + 


> 


1 


1 


2P 


-3 I 


N 


0.-3 1 „ ) 


>- 


2P-3 
1 


2 p 




N-l 

N 

N (f- 1 ) 


iV(l + f-l) 2P-3(f) 
(51) 


To show that inequality (HTl) holds, it suffices to prove the following inequal¬ 
ity (see inequality dsni),®): 


Thus, it suffices to choose N > 2^ 


(f-1) 

2Nn 2P-^{^) 

^ n ' 

-2(^f 


Proof for inequality 

—1 


-^-1 ■ 

: To prove inequality 


we consider the cal¬ 


culations in 


i-n , i h —TT and estimate the amount of precision that 
we lose in the approximation • 

From Theorem l5.4liil we conclude that j units of precision is lost by subtract¬ 
ing 1 from X where j > log 2 (We obtain the following inequalities: 


, / , ,Pn M 1 , m 

IV ! \ ^ 2P-J ' n ' 

I z , , i . TTi .. 1 I ^ / nr ^ I 

h^(a;-l)-^-hv- 

N N n 2P 3 N n 

The approximation j^(^ — 1) of jf{x — 1) is positive. Hence, we do not 
lose precision by adding jj{x — 1) and 1 (see Theorem 15.4li|) . 

I,. r. i\\ nr ^ \ \ i ^ i^ / nr .. 

(I + "^(2^ ~ I)) ~ (I + ■^(-I)) < C)p- J I + Iv’^- 

N N n 2P 3 N n 
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By approximating the inverse of 1 + — 1), we lose 1 unit of precision 

(see Theorem 15.31) : 

, 1 _ 1 , 1 , 1 , 

We lose 2 units of precision by multiplying and i \x-i) Theo¬ 
rem [O]): 


, X — 1 


1 




(f- 1 ) 

N 


1 


1 -b 


< 


1 


2p-j-3' 


N 




Since the numbers are approximated by the positive numbers 

, i —TT, calculating the summation , i ) —tt does not 

affect the precision: 


,x — 1 


N-l 

E 


m _ ^ 




N-l 

E 


2=0 


< 


1 


m _ ^ 


2p-^-3 ' N 


N n 
N-l 


E 

2=0 


1 




ii. Suppose 0 < ^ < 1. We use the following identity to approximate ln(x): 

ln(x) = — ln( —) 

X 

We approximate ^ by {n^m^p— 1) (see Theorem l5.3p . Then, we apply the 
first part of the theorem and Theorem IS.ll to approximate — ln(i). 

□ 


Algorithm [7] applies Theorem 16.31 to approximate ln(x) with arbitrary pre¬ 
cision. Note that when the approximation — is close to 1 the amount of pre¬ 
cision that we lose in the calculations depends on the magnitude of Loss 
of precision for x ~ 1 in Algorithm [3 is due to our approximation formula, 
/ —TT- We divide the interval between 1 and x into N subin- 

tervals and approximate the area under the curve f{t) = The length of the 
interval [l,x] is crucial in our approximation. Thus, recomputing x with higher 
precisions is necessary when a significant amount of precision is lost in x — 1 
(see Line 131221 in Algorithmic. 

To show that approximating ln(x) for x ~ 1 is fundamentally problematic 
we apply perturbation analysis on /(x) = ln(x): 


xfjx) I ^ ifiil 

/(x) 'ln(x) 


^ ln(x) ^ 
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Require: expr has the shape ln(a;) 

1: procedure COMPUTE(ea;pr,p) 

2: Px ^ P + 5 

3: repeat 

4: ^ ^ COMPUTE(a;,Pa;) 

5: if — > 1 then 

n 

6; arg ^ — 

^ n 

7: £ ^ 4 

8: else if 0 < — < 1 then 

n 

9: arg ^ ^ 

10 ; £ ^ 5 

11; else 

12 ; “Undefined operation” 

13; j ^ riog2(^)l 

14; Px — j — £ > P then 

15; N^\2P--^^^'] 

I arg— 1 I 

IR. m' , arq-1 y^iV-l 1 

n' N Z^i=0 i+^(org-l) 

17; if — > 1 then 

n 

18; return ^ 

n 

19; else 

20; return 

n' 

21; else 

22; Px Pa; + 1 

23; until true 
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Figure 3: Approximating arctan(a:) 


When a; ~ 1 the quantity ln(a;) is a small value and hence committing a small 
error in the approximation of x causes a significant error in calculating ln(x). 
Thus, for a: Ri 1, iterative computations are unavoidable. 


6.3 Arctangent 

We first introduce an approximation of arctan(x) using Riemann sums. The 
assumption is that the precise value of x is available. Afterwards, we extend 
our calculations to introduce an approximation of arctan(a;) based on a repre¬ 
sentation {m,n,p) of X. 

Suppose that a: > 0 is a real number and we want to approximate arctan(a:). 
We consider the curve y = see Fig. 3(a) We calculate the area enclosed 


by this curve and the t-axis between t = 0 and t = x: 


dt 


l + t"^ 


— arctan(x) 


We approximate this area using Riemann sums. Fig. |3(b)] shows approximations 
from above and below for the integral using rectangles. From Fig. |3(b)| we can 
derive the following inequalities for N rectangles: 


N N-1 


We want to estimate the precision of our approximations. Suppose we ap¬ 
proximate arctan(a:) by i_|_(_L) 2,^.2 ■ The absolute error of this approxi- 
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mation can be estimated as follows: 


N-l 

E 


N-l 




-arctan(a:)| < 1 ;^ E 


1 


N 


1 


2=0 


= -(l- 




X 1 


1 


r) = 


1 + a;^ -/V(l + a; 2 ) 


(52) 


Up to this point, we have assumed that x is precisely calculated. In what 
follows, we assume that x is approximated by (m, n,p). We extend the Riemann 
sum calculation to compute arctan(a;) using the given approximation of x. 

Theorem 6.4. Let x be a real number represented by {m,n,p). Then arctan(a;) 


can be represented by {m',n',p — 6) where ^ ^ 


=0 i+iwTi^r 


and 


iV = 2 P- 2 [(^) 2 ], 

Proof. We consider two cases: 

1. Suppose — > 0. Since x is represented by (m, n,p) we can write: 


m, 1 , m, 1 . 2 m 

— 1 - < 2^ < — 1 + TT < - 


To prove the theorem, we should show that: 

1 


(53) 


/m\ N—1 

I arctan(x)-^ ^ 


N-l 




2=0 V n >' 2=0 Vn/ 


(54) 


We rewrite the left hand side of inequality (f54|l as follows: 

N-l 


(—) 

I arctan(a:) - 


1 


N ^ 1 -I- (±)2(inY 

i=0 ^ ^ y n ) 


N-l 


I arctan(a;) “ E 
2=0 

N-l 

X V—> 

+ vS 

2=0 

A ^-1 

I arctan(a;) - ^ E 
2=0 
N- 

E 


1 


1 + 

1 (f) V- 

l + (i)2a.2 TV ^ 1 


N-l 


1 


'■N ^ 

1 


2=0 




2 I — 


+ 


x ^ 1 


N-l 

E 


(f) _ 

^ i + (^)^(f)^' 
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Thus, to prove inequality (ISTl) . it suffices to prove the following inequali¬ 
ties: 


I arctan(x) - ^ 


N-l 

X 1 


1 1 


i-y 

' N ^ 


N ^ 1 + (»)22;2' 2P-5' N ^ 1 + fJ.'|2fin'|2 

i=0 ~ \ N J i=o ^ ^ \ N^ V „ 1 

1 (-) 


(55) 


_ (f) y-_L 

IV 1 I r t Y. 




< 


1 I (^) y^_1 

-5 I AT 2-^ ^ \ ( i \ 


N-l 


2P-5 'TV ^ 1 + fi-)2('m\ 
i=0 ^ 1 AT / V n / 


(56) 


Proof for inequality (1551) : We first consider the left hand side of in¬ 
equality (l55|) and calculate an upper bound for it. From inequality 
we can write: 


I arctan(a;) - jy 


N-l 

X 1 


< 


^ ^ l + (l^)2a:2'- lV(l+x2) 


(57) 


To obtain an upper-bound for jv(f+^^ consider the function f{x) = 


N(i+x^) calculate its derivative: 


fix) = .... . o^o > 0 


A^(l-bx2)2 

Thus, f{x) is an increasing function and its maximum occurs when x gets 
its maximum value. Inequality (l53ll implies that ^ is an upper bound 
for X and hence we can rewrite inequality (15711 as follows: 


arctan(x) - nt 


N-l 

X 1 


< 


< 


8(^)3 


^ ^ 1 + Nil + x2) - fV(l + 4(f )2) 

8(^)3 


< TV(1 +'Vf )2) 

We also calculate a lower bound for the right hand side of inequality (jSSll : 

1 fm] 1 1 (ml 1 

op-s I TV 2^ 1 4- (±\2 (m\'^ ' ^2P~^ N ^ 1 I ( N-I- ')2( m\2 

i=0 “’"Vvt I „ 7 i=0 ^ V N ! Inl 


TV 2 


V 2 P- 5 I 1tV 2-h (TV-l)2(^): 

^ T 1 \ ('"^w '1 

>l2P-5^'^n^'^iV2(l + (^)2)^ 

(-) 

''71/ 


r) 


2P-5(i + (^)2) 


(59) 
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To prove inequality (1551) . it suffices to show that the following inequality 
holds (see inequality dSll),®): 


8(^)3 


< 


(-) 


iV(l + (f)2) 2P-5(l + (f)2) 


We divide all the components by obtaining: 


m? . 1 

N ^ 2P-5 

Thus, it suffices to take N = 

Proof for inequality (1551) : To prove the inequality, we consider the 
calculations in % , and estimate the amount of error that 

iV ^t—K) 

we commit in the approximation i-|-(» ) 2 (^rnyi ■ 

From Theorem 15.21 and inequality (15311 . we conclude that 2 units of preci¬ 
sion is lost by approximating by and hence we obtain: 



1 

2 P -2 





< 


1 1 / i n 2/™',2 
-2lUrl 


The approximation of is positive. Thus, we do not lose 

precision by adding and 1 (see Theorem 15.4lil) : 


l(l + (^)V)-(l + (^)^(-)^ 

N N n 


2 P- 2 ' 


We lose 1 unit of precision by approximating the inverse of 1 -I- {jf)‘^x^ 
(see Theorem 15.31) : 


< 


l + (i)2^2 l+(.)2(^)2' 2P-3'l+(i)2(^)2 


We lose 2 units of precision in the approximation of jj- (see 

Theorem 15.21) : 


'N 1 + ij.y 


(^) 1 


< 


2P 


~5 


(f) 1 I 


The approximations ^-^e positive for 0 < i < — 1. Thus, 

we do not lose precision by calculating the summation ^ i-|-( ^ 23,2 
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(see Theorem 15.41111 : 


N-l 

\-y 

' N 


/m\ N — 1 
' n / \ ' 


'-/V ^ 1 + (4.)2a;2 N ^ l + (±)2(rnY 


< 


1 ( 21 ) 

V n / 


iV-1 

E 


2P-5I jV 1 + f±'|2('m'>2 I 

i=o ^ V AT j V „; 


2. Suppose ^ < 0. We can write: 


TO _ 1 , TO 1 , 

< TT*'- 


Observe that a; < 0; we can use the following identity for arctan(x): 
arctan(a:) = — arctan(—x) 


where —x > 0. We first approximate arctan(—x) using the first part of the 
proof. Afterwards, we apply Theorem 15.11 to approximate — arctan(—x). 
Unary negation does not influence the precision. 


□ 


Algorithm 8 Arctangent 


Require: expr has the shape arctan(x) 
1: procedure COMPUTE(expr,p) 

2: ^ ^ COMPUTE(x,P + 6) 

3: A^2P+4[(^)2] 


4: 


(w) Y^Af—1_1_ 

N 2^i=o l + (^)2(^)" 


5: return 

n' 


Algorithm [ 8 ] applies Theorem 16.41 to approximate arctan(x) with arbitrary 
precision. Note that Theorem l6.4l predicts the amount of precision that is lost by 
calculating arctan(x) independently of the argument x and hence Algorithm [5] 
calculates arctan(x) in one pass. 

To confirm that arctan(x) is computable in one pass, we apply perturbation 
analysis and calculate the quantity 1I /(^) = arctan(x): 

, xf'{x) ^ I ^ I_ X _ 

/(x) arctan(x) (1 + x^) arctan(x) 

Proposition |4] (see 1^ shows that I (i+a;:i) rrctan(a:) I ^ iterative computations 
are not required for approximating arctan(x). 
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7 Approximating Transcendental Functions by 
Taylor Expansions 

In this section we first briefly discuss the basics of approximating functions using 
Taylor expansions. Afterwards, we use Taylor expansions to approximate sin(a;) 
and cos (a:). 

Suppose / : H —7> i? is a function and I = {a, b) C D such that: 

• / has n continuous derivatives on I (denoted by for 1 < * < n); 

• /("■+!) exists on /; 

• Xq G I. 

Taylor’s theorem states that for every x G I there is a number Cx between x 
and xo such that f(x) = Pn(x) + Rn{x) where: 

Pn{x) = V --- -{x - XqY , Rn{x) = . , x , - Xo)"+ (60) 

^ *■ (n + 1)! 

The formula Rn{x) is called the Lagrange form of the remainder [20] . 

In the remaining of this section, we discuss approximations for sin(a:) and 
cos(a:). Our approximations are based on the following Taylor expansions around 
the point xq = 0 : 


OO 

sin(a;) = ^ 
2=0 


( ^2»+l 

(2i + l)! 


OO 

cos(a:) = 

2=0 


(-ir 

(2z)! 


X 


2i 


(61) 

(62) 


For each function, we first describe an approximation that is applicable to 
the base interval I = (—1,1) (see Section lTT]) . Afterwards, we extend our calcu¬ 
lations to the complete domain of the functions using range reduction identities 
(see Section [H. Proof of correctness for the base and general cases are pro¬ 
vided. Moreover, the COMPUTE(ea;pr,p) function is extended to approximate 
sin(a;) and cos(a:) with arbitrary precision. We use perturbation analysis to 
confirm the observations obtained from the approximations. 


7.1 Approximating Functions in the Base Interval 

7.1.1 Sine 

In this section we use the Taylor expansion from equality (1611) to approximate 
sin(a;). We assume that the input argument x is represented by {m,n,p) and 
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Theorem 7.1. Let x be a real number represented by {m,n,p) such that —1 < 
^ < 1. Then sin(a:) can be represented by [in',n',p — 2 |’log 2 ( 2 A^ — 1)] — 3) 

where ^ (L+i)! N € N is an odd number such that 

(|)((2JV+22gJ^) > 2p^ 

Proof. The real number x is given with precision p: 


m 

n 


,m, m 1 ,m, m 1 ,m, m ,m, 


(63) 


To prove the theorem, we should show that the following inequality holds: 


I sin(a;) 


E 


(—1)® TO 2i+l 
( 2 i + 1 )!*'^^ 


< 


1 

2p-2nog2(2Ar-l)l-3 


IE 


(-ly TO 2^+1 

( 2 i + l)!^n^ ' 

(64) 


We rewrite the left hand side of inequality 


(EH) as follows: 


7 = 0 ^ ’ 


N 


sin 


sin 


iV 

(2^) - E 


S ' -hiEa;2»+i I V EE_a;2*+i _ V -PP—( — P^\ < 

^{21 + 1)1 ^(2z + l)! ^(2z + l)!^n^ 

^=0 7=0 7=0 

W-EEa;2»+i| I I V EE_a;2*+i _ V ,TO,2*+1, 

^(2z + l)! I 'P^(2^ + l)! 


7=0 


( 2 z + l)!^n^ 

7=0 

Thus, we prove inequality (l64ll by showing that the following inequalities hold: 

I • / \ (~1)* 2i+l| ^ I (~1)* ,TO,2i+l 

|sin(a:) ( 2 i + l)!^ I < 2 P- 2 riog 2 ( 2 A(-i)l -2 I E ( 2 ^+ 1 )!^^ 

7=0 ^ ^ 7=0 ^ ^ 


(65) 


w _(zE_^2i+i _ (-1)' 

^( 2 * + l)! ^( 2 * + l)!^n^ 


< 


1 ^ (- 1 )* 2 *+i 

2 P- 2 riog,( 2 Ar-i)l -2 ( 2 i+ 1 )! I 

Proof for inequality (1651) : Based on Taylor’s theorem (see equality (|60)) 1 and 
inequality (l63ll . we can rewrite the left hand side of inequality (l65ll as follows: 

^(2i + l)! ' ' (2V + 2)! (2)V + 2)! ' ' 


sin( 
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We choose = 2fc + 1 > 1. Applying Proposition [S] (see|3 we can rewrite the 
right hand side of inequality (l 6 ^ as follows: 


N 


1 _ (-ly 

-1)1-2 I lo.- I 1 M V „ 1 


2 p- 2 riog 2 ( 2 Ar-l)l -2 ' ( 2 ^ + 1 )! ' n 

i=0 ' ^ 


sgnif) 


2p-2riog2(2Ar-i)l-2 

i=0 


{2i + iy.n 


( 68 ) 


where sgn is the sign function. To prove inequality ([65]), it suffices to show that 
the following inequality holds (see (l67l) . (|68)) 1 : 


1 |2Af+222Af+2 


sgni^J _ (-1)' .m 2»+i 

/-l)]-2 2-^ /"o.- I 1 M V „ 1 


( 2 Af+ 2 )! 

This is equivalent to: 


< 


2 P- 2 nog 2 ( 2 Ar-i)l -2 ( 2 ^ + 1 )! ^ n 

i=0 ' ' 


sgn(f) 


N 


- 1 ) 1-2 


-fji 2i+l |m |2Af+222Af+2 


( 3 ) 


2 P- 2 riog 2 ( 2 Ar-i)l -2 (2i + 1 )!' n 

2 = 0 ^ ^ 


^gn(f) 


2P-2riog2(2Ar-i)l- 


dlF.(v-(vKv)’ + i: 


N 


{2N + 2)l 

(—1)® TO 2i+l 


3! n 


^ ( 2 i + 1 )! n 

I m |2A^+222A^+2 

( 2 Af+ 2 )! 


( 3 ) 

> 0 


(69) 


The quantity sgn{^)'^f ^2 (L+i)! positive (see Proposition 15] in lAT) . 

Moreover, the approximation ^ satisfies |^| < 1. To prove inequality (15^ it is 
sufficient to show: 


2 p- 2 riog 2 ( 2 Ar-i)]- 2 ''„ ’ {2N + 2.)\ 


sgn{^){2N - l)\m 
n 


I m |322A^+2 

(--(:^)(-)^)- '"L >0 


'3!' n 


( 2 Af+ 2 )! 


2 P -2 

Inequality (iTOl) is equivalent to: 

.TO,,TO,, /Ix/’Tl,, 2^+^^ /m,2N 

~ (2A^ + 2)!(2A^- 1)2^^^ ^ ^ ® 

Since sgn(^)(^) > 0 and (^)^ < 1, we should choose an N such that: 

^ 2P+2N-2\log2{2N-l)] 


(70) 


6 (2Af+ 2)!(2Af-1)2 
Thus, N should satisfy (|) ( 2 ^^+ 2 )K 2 Af-i) ^ 2 P_ 


< 1 
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Proof for inequality (1551) : We show that given an approximation ^ of x with 

precision p we can approximate ( 27 + 1 )! ^^'"^^ by 

precision p — 2 |’log 2 ( 2 A^ — 1 )] — 2 . 

The sign of the terms alternates between positive and negative. 

Thus, adding two arbitrary terms with different signs from the expansion can 
significantly reduce the precision (see Theorem I5.4lii|) . To avoid this, we first 
consider specific pairs of terms for which loss of precision due to addition is 
bounded. Then, we calculate the summation of these pairs. The following 
identity shows the way we calculate ( 27 +i)! ^^^~*~^- 


E 


( ^2i+i ^ 

( 2 z + l)!'^ 


k 


E 


^- (1 


(4i + 2)(4i + 3) 


) 


Choosing = 2fc + 1 allows us to make pairs of terms. 

The number x is given with precision p. Thus, we can approximate x^ with 
precision p — 2 (see Theorem 15.2p : 



1 

2P-2 



and hence 


i-f 

v n, / 


< 


1 


(m)2 

V n / 


'(4i + 2)(4i + 3) (4t + 2)(4i + 3)' 2^-2 ' (4i + 2)(4i + 3) 


2 ( — 

We approximate 1 - ( 4 ^_^ 2 K 4 i+ 3 ) by 1 - ( 4 i+ 2 )( 4 i+ 3 ) ■ boss of precision in the 

(m)2 

I ' ri ■' _ 

approximation can be estimated by calculating the quantity log 2 (— 

( 4 i + 2 )( 4 i+ 3 ) 

(see Theorem 15.4liiP : 


log2(- 


1 + 


1 - 


(4i+2)(4i+3) 

Thus, we lose at most 1 unit of precision; 

V n / 


(wf , , 1 „ 

(4i+2)(4i+3) , , , 6 \ 1 

' ) < l0g2(Y—f) = l0g2(-) < 1 


(w) 


1 ( 1 - 


X 


/m \2 
V n / 


(4z + 2)(4i+ 3) 


)-(l- 


< 


(4i + 2)(4i + 3)" 2P 


-3 


II- 


(4z + 2)(4z + 3)' 


The powers are approximated by (^)‘**'*'^ for 0 < * < fc and 2 |’log 2 ( 4 i + l)] 
units of precision is lost in this operation. Thus, in the worst case, the precision 
is reduced by 2 |'log 2 ( 4 fc + 1 )] units: 

I 4i+l _ ^ _|/'!Zl'\4i+l| _ __|('!Z('l4i+l I 

' ^n’ I 2P-2riog2(4fc+l)l I 2P-2riog2(2Af-l)l I 
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Multiplying approximations of by a constant does not influence the pre¬ 

cision: 


^4^+l 

(4f-h 1)! 


m \4z+l 


(-) 

V n ^ 


m \4i+l 


< 


(-) 

V n ^ 


(4t -I-1)! ' 2P-2nog2(2Ar-l)l ' (4j + 1)1 ' 


The multiplication be approximated by multiplying 

/rn.\4i+l ( — )^ 

the approximations and 1 — ( 4 ^ 4 . 2 )( 4 z+ 3 ) Theorem 15.2^ : 


r.4i+l 


f(1- 


/ m \4i+l 

:)-77^(1- 


(m)2 

\ n ^ 


'(4t-hl)!' (4t-h2)(4f-t3)' (4t-hl)!' (4t 

/ m \42+1 
I V n / 


< 


( 1 - 


2){M + Z)'' 

/ m \2 
\ n ^ 


1 

2p-max{2riog2(2Af-l)l,3}-2 I ( 4 ^ _j_ 1 )| 

-I /-mpi-l-l 

_liili__ 

2P-2riog2(2Ar-l)l-2 I (4^ + 4)1 V (4- 2)(4t -t 3) 


{Ai + 2){Ai- 

fm\2 

t n ^ \ I 


■3)' 


For the last equality, we use the assumptions (|) ( 22 w^^^igi'( 2 jv - oi ) > 2^ and 
= 2fc -I- 1; we can conclude that TV > 3. 

For 0 < i < k the terms ( 4 /+ 1 )! “ ( 4 i-s 2 )( 4 i-i- 3 ) ) have the same sign which is 

determined by the sign of ^ (see Proposition[S]in|^. Thus, we can approximate 

Eto (3^(1 - ( 4 l+^p+ 3 )) by Eto WiTr(l " without losing 

precision (see Theorem 15. 4li|) : 


lE 

2=0 


„42+1 


(4f + l)! 


( 1 - 


(4f-h 2)(4t-h 3) 

1 


^ /m\42+l 

-Efe^d- 


/ m \2 
t n 2 


2=0 


(4f-hl)! 


2p-2riog2(2Ar-l)] 




(4f-h 2)(4f-t 3) 

/ m \2 
n ' 


)l< 


i =0 


(4f-Hl)!' {Ai + 2){Ai + 2,)' 


□ 


Theorem 17.11 provides a top-down approximation for sin(a;) in the base in¬ 
terval. Loss of precision in this approximation is estimated independently of 
the argument x. To show that iterative calculations can be avoided in the base 
interval, we calculate for f{x) = sin(a:): 


xf'jx) 

/d) 


X ■ cos(x) 
sin(a;) 


\x ■ cot(x)| 


(71) 


Proposition [7] (see El shows that |a; • cot(a:)| < 1 for a; G (—1,1). Thus, we can 
approximate sin(a;) in the base interval in one pass. 
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7.1.2 Cosine 

In this section we introduce an approximation for cos(a;) where x is represented 
by {m,n,p) and |^| < 1. Our approximation is based on the Taylor expansion 
from equality (15^ . 

Theorem 7.2. Let x be a real number represented by {m,n,p) such that —1 < 
^ < 1. Then cos(a;) can be represented by (m', n',p — 2 |’log 2 ( 2 fV —2)] —3) where 

^ ^( 20 ! N gN is an odd number such that > 

2P_ 


Proof. The real number x is represented by {m,n,p) and hence we can write: 

(72) 

IL IL It IL 

To prove the theorem, we need to show: 


m ,m, m 1 ,m, m 1 ,m, m 


( —1^* 777 2i 1 (—iV TH 

|cOs(a:)-^-^^( —) I < 2p-2riog2(2Ar-2)l-3l5Z"(^(“) I 
2=0 ^ ’ 2=0 ^ ’ 

We rewrite the left hand side of inequality (f73|l as follows: 


(73) 


N 


COS 


W-Sw*-)”!" 

(2i)\ n 


2=0 

N 


COS 




2=0 

N 


COS 


M - J: 


2=0 

N 


(2z)! 


2=0 

N 




(2z)! n ' 

- 1 )^ 

( 2 i)! ' n ' 


^ (20! ' ' (20! 

i=0 ^ ' i=0 ' ' .— 

We prove inequality (17^ by showing that the following inequalities are valid: 

i_ly (74) 

,(2iV-2)l-2lZ^ l'q„;U 1 I 0^1 


N 


COS 


f ^ (“!)* 2i| ^ 

(x) - Z^ -7::-rrX | < 


i=0 

N 


( 2 *)! 


2P-2[iog^(2N-2)^-2< ^ (20! n' 
i=0 ’ 


2 i _ 1 _W (- 1 )^ .m 2 i 

( 20 ! ^ ( 20 ! ^ 2 P- 2 ri°g 2 ( 2 Af- 2 )l -2 ' Z^ ( 20 ! ^ n^ ’ 

Proof for inequality ([7^ : We use Taylor’s theorem (see equality (1601) ) and 
the bounds calculated for x in inequality (f72)l to rewrite the left hand side of 
inequality (f74|l : 


N 


COS 


(X) - y: 


Cos(2^+l)(c,)x2^+l |^|2Ar+l22W+l 


i=0 


(2z)! 


(21V + 1)! 


< 


(21V + 1)! 


(76) 
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We choose iV = 2fc + 1 > 1 and apply Proposition [5] (see El to rewrite the right 
hand side of inequality (17^ : 

_ I _ 1 ^ (- 1 )^ m ^_ 1 _A {-ly m 2z 

2 p- 2 \log 2 { 2 N- 2 )]- 2 \ 2^ (2i)\ ^ n> I 2 P- 2 riog 2 ( 2 W- 2 )l -2 ^ 

(77) 

To show that inequality (TTH) holds, it suffices to prove the following (see in¬ 
equality (1751) .(1771)1: 


I m |2Af-|-l22Af-|-l 


N 


< 


(2iV-hl)! 2P-2ri°S2(2iv-2)l-2 ^ (2*)! 


- 2 ) 1-2 


Inequality (175)) is equivalent to: 


N 


(-l)ffiTO 2 * \IR\2N+l22N+l 


E zzZ('_r - 

(9.iV n > 


2P-2\\og^{2N-2)^-2 {2i)\^n- 

i=0 ' ' 


(2IV-h 1)! 


1 


N 


Zm,, . v^(-l)%m 2* |^|2iv+i22Af+i 


2 P- 2 riog 2 ( 2 Ar 




{2N + l)l 


> 0 


From Proposition [5] (see 1X1) we conclude that the quantity J2f=2 72 ^ 
positive. Since |^| < 1, it suffices to show: 


1 

2 p- 2 riog 2 ( 2 Ar- 2)]-2 




m |2iV+l22Ar+l 
n I _ 

( 2 A^+ 1 )! 


(2jV-2)^ Im, 

2 P -2 '' 2^71^^ ^ 


22N+1 

(2iV-h 1)! 



Inequality (179)) is equivalent to: 


1 m ,2 2 P+^^ 1 ,w ,2 

2 ''n^ ( 2 iV-hl)!( 2 iV- 2 ) 2 '^n'^ 


(79) 


Since ( —)2 < 1 , it is sufficient to choose an N that satisfies the following 
inequality: 


2P+2N—1 

2 (27V-h 1)!(27V- 2)2 ^ ^ 

Thus, we should choose an = 2fc -|-1 that satisfies ( 2 ^+i)K 2 ^- 2 ) ^ 2 p. 
Proof for inequality (ITSl) : To prove the inequality, we estimate the amount of 
precision that is lost when we approximate 

The sign of the terms alternates between positive and negative. Thus, 
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adding two arbitrary terms with different signs from the expansion can poten¬ 
tially cause significant loss of precision (see Theorem 15.dliil) . To avoid this, we 
hrst consider pairs of terms for which addition can be calculated with a bounded 
loss of precision. Afterwards, we calculate the summation of these pairs. The 
following identity shows our computation scheme: 


E 


(- 1 )^ 

( 2 *)! 





{4i + l){Ai + 2) 


Choosing = 2fc -I- 1 allows us to pair the terms of the summation. 

Since x is given with precision p, we can approximate with precision p—2 
(see Theorem 15.211 : 


- C-n < ^\-\ 

n 2P ^ n 


Multiplying the approximation of by a constant does not influence the pre¬ 
cision: 


/m \2 
^ n ' 


< 


i-Y 

^ n ^ 


' (42 -h 1)(42 -h 2) (42 -h 1)(42 -h 2)' 2?^-2 ' (42 -h l)(4i -h 2) 


2 ( — 

We approximate 1 - ( 4 -,^ by 1 - ( 4 ^^ 4 )( 4 ^^ 2 ) ■ estimate lose of pre- 

( m)2 

l“l" ^ n ' 

cision in our approximation, we calculate the quantity log 2 (—(see 

(4i + l')(4i+2) 

Theorem I5.4lii|) : 


log 2 (- 


.l-f 
1 - 


(w)' 


) < 1082(^31 


1 -b i 

^ ) = log 2 3 < 2 


{4t+l){4t+2) 

Thus, we lose at most 2 units of precision: 

rr.2 (1R)2 

V n / 


1 ( 1 - 


(4z-bl)(4i-b2) 


)-(l- 


{M + l)[M + 2)” 2P 


)i< Aiii- 


fm\2 
^ n ■' 


(4i + l)(4i + 2)' 


Approximating by (^)'^* reduces the precision by 2 |’log 2 ( 4 z)] units (see 
Lemma [T] in lA)) : in the worst case we lose 2 [log 2 ( 4 fc)] units of precision: 


I 4i _ /^)4ii , __h — i _ If — 'I'^d 

' I ^ 2P-2ri°g2(4fc)l ' 2P-2n°g2(2Af-2)l I 

Multiplying (^)^* by a constant factor does not influence the precision of the 
calculation: 




(4z)! 


'm\4i 
< n ' I 

(4z)! ' 


< 


1 

2p-2riog2(2Ar-2)] 


' rn\4i 
' n 2 I 

(4z)! ' 
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4i 2 ( m \ 4.i 

We approximate ^(1 - ( 4 ,+i')( 4 ,+ 2 ) ) multiplying the approximations 

/ m >,2 

(1 “ ( 4 »+ih 4 »+ 2 ) ) Theorem [O]): 


v.42 


m \4i 


i(i- 


(-) 
V n / 


/m \2 
V n / 


(4t)!' (4t + l)(4i + 2)' 


-( 1 - 


< 


1 


(4t)! ' (4t + l)(4i + 2)" 

(m\4i / m \2 

I V n / /I V n / 


2 p-max{ 4 , 2 |'log 2 ( 2 Ar- 2)]}-2 I 

1 / m \42 


( 1 - 


(4t + l)(4z + 2) 

/ ?71 \2 

^ n ^ M 


)l 


■2P-2riog2(2Ar-2)l-2l (4^)] ^ (4t + l)(4t + 2) 


To obtain the last equality, we use the assumptions > 2^ and 

N = 2k + 1; we conclude that N > 3. 

/m\4i / m \2 

The terms ( 4 /)! (1~ ( 4 i+i 5 ( 4 z+ 2 ) ) positive for 0 < z < A: (see PropositionjB] 
in El) and hence we can approximate the summation ~ 

without losing precision (see Theorem l5.4lill : 


k 




2=0 


(42 + l)(4i + 2) 
1 


fe (7n\4i 

)-Ey;r(i- 


(m\2 
V n / 


2=0 


(4z)! 


2p-2riog2(2Ar-2)l 


k /m\42 


2=0 


(4z)! 


(4i + l)(4t+ 2) 

/m \2 

_ ^ n ' _ 

(4i + l)(4i + 2) 


< 


)l 


□ 

Theorem O estimates loss of precision in cos{x) in the base interval in¬ 
dependently of the argument x. To show that iterative computations can be 
avoided in the base interval, we calculate for f{x) = cos{x): 

KtV = -^=a;-tanOr 80 

f{x) cos{x) 

From Proposition[5]we conclude that |a;-tan(a;)| < tan(l) for x G (—1,1). Thus, 
we can approximate cos(a:) in the base interval in one pass. 

7.2 Extending Base Interval Approximations 

In Section 17.1.11 and 17.1.21 we discussed approximations for sin(a;) and cos(a;) 
in the base interval. In what follows, we show that range reduction identities 
can be used to extend these approximations to calculate sine and cosine for an 
argument x represented by {m,n,p) where |^| > 1 . 

In our calculations, we use a representation {m',n',p) of tt. This representa¬ 
tion can be obtained based on our approximation for arctan(x) (see Section 1^31) 
and the following identity: 


TT = 4arctan(l) 
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7.2.1 Sine 


Theorem 7.3. Let x he a real number represented by (m, n,p) such that | ^ | > 1 
and {m',n',p) be a representation for tt. Suppose ^ ^ where k G 

and 0 < ^ < ^. The value of sm{x) can be approximated as follows: 

i. //O < ^ < 1, then sin(a;) can be represented by {mi,ni,p — ii) where: 


i\ =si + ti 

-^1 /• 1 — 
m ^ { 2 i + 1)! n 

i—O ^ ' 


ii. //1 < ■= < 2j then siii(x) can be represented by (m 2 , n 2 ,p — 12 ) where: 
^2 H- ^2 H“ 2 
m 2 


=2f y -izlilL(Il) 

^2 (2z + l)!4n^ 


(~1)^ , TO 

2^ (2z)! ^2n’ 
i=0 ^ ' 


Hi. 7/2 < = < then sin(x) can be represented by — is) where: 

is =si + S2 + is + 6 


ms 

ns 


-HE 


(-ly 

h 4zz^ 


z=0 

A^3 


(2i + l)!*'4zz^ 


^ (—1)* ,m' Z7Z 

/ ^ (<yj I A—' 


z=0 


(2i+l)!Mn' m' 


N 4 / - \ J f _ 27 

E (—Ij'^m m ^ 

( 2 i)\ ~ W 

7=0 ^ ' 


In the approximations above, si, S 2 , Ni, N 2 , N 3 , N 4 G N are the smallest natural 
numbers satisfying: 


1 + 


Si > log2 


max(|^|,fc^) ^ 1 / 

7-) , S2 > l0g2( 


1 + 


1 - 


maxd^l.fcZ!!;) 


1 - 




,5,,(2iVi+2)!(2iVi-l)% ^ (2iV2 + l)!(2iV2 - 2)2 ^ 

*' 6 '''' 22^1 ^ ^ ’ 22^2 ^ ^ 

5 (2jV3 + 2)!(2jV3-l)^ . ^ (2jV4 + l)!(2iV4 - 2)2 

2‘sn3 j ^ ^ > 22 N 4 ^ ^ 


and ^ 1,^2 G N are defined as follows: 


h =2riog2(2iVi-l)l +3 

t2 =max(2[log2(2Afi - 1)1 + 3, 2riog2(2iV2 - 2)] + 3) 
ts =max(2|'log2(2A^i - 1)] + 3, 2|'log2(2iV2 - 2)] + 3, 
2riog2(21V3 - 1)1 + 3, 2riog2(2iV4 - 2)1 + 3) 
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Proof. Since |^| > 1, we can choose k G Z, such that = = ^ and 

0 < = < ^. Suppose y = X + kn. We use the following identity to calculate 
sin(a::): 


sin(x) = sin(a; + kn) = sin(j/) 


(81) 


We approximate y hy = = ^ + k^ (see Theorem 15.4lii|) . By performing this 
approximation, we lose si £ N units of precision where si is the smallest number 
satisfying: 


1 + 

Si > l0g2(- 

1 - 


max(| — \ ,k^^) , 

_ ^I ’ n' ■' \ 

min(| ^ 
max{| ^ 


We consider three cases for calculating sin(j/): 

1. Suppose 0 < ^ <1. We apply Theorem 17.11 Thus, ti = 2|’log2(2A^i — 
1)] + 3 units of precision is lost in the approximation of sin(y). A total of 
Si + ti units of precision is lost in the approximation of sin(a:). 

2. Suppose 1 < = < 2. We use the following identity to bring the argument 
within the base interval: 


sm{y) = 2sin(|)cos(|) 


(82) 


We approximate f by Since i < ^ < 1, we apply Theorem 17.11 and 
[U to approximate sin(|) and cos(^), respectively. Loss of precision in 
these calculations is as follows: 


t2 = max(2|'log2(2Afi - 1)] + 3, 2|'log2(2Af2 - 2)] + 3) 


The multiplication in equality (15^ reduces the precision by 2 units (see 
Theorem [S3). A total of Si + t 2 + 2 units of precision is lost in the 
approximation of sin(a;). 

3. Suppose 2 < = < ^. We use the following identity to bring the argument 
within the base interval. 


sm{y) = 2sin(|)cos(|) 

= 4sin(|)cos(|)cos(|) 

= 4sin(-)cos(-)sm(- - 

= 8sin(^) cos(^) sin(^ — 
^4 4 4 


^) 

2 ’ 

y\ 


(83) 
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We approximate | and ^ | by ^ and ^ — |=, respectively. We lose 

S 2 S N units of precision in this approximation (see Theorem l5.4liil) where 
S 2 is the smallest number satisfying: 


1 + 

S2 > l0g2(- 

1 - 


' 4n' ’ 4n 


We apply Theorem 17.II and 17.21 to approximate sin(|), cos(|), sin(^ — |), 
and cos(j —|). Lose of precision in these approximations can be calculated 
as follows: 


ts = max(2|'log2(2W - 1)] + 3, 2|'log2(2A^2 - 2)] + 3, 
2riog2(2iV3 - 1)1 + 3, 2riog2(21V4 - 2)1 + 3) 

The three multiplications in sin(|) cos(|) sin(^ — |) cos(^ — |) reduce the 
precision by 6 units (see Theorem l5.2l) . A total of si + S 2 + ts + 6 units of 
precision is lost in the approximation of sin(a;). 


□ 

Algorithm [5] applies Theorem 17.11 and 17.31 to approximate sin(a:) with ar¬ 
bitrary precision. In this algorithm, initially, we calculate x with a precision 
that is adequate for calculations in the base interval. However, if the obtained 
approximation is outside the base interval, we use the half-angle formula or add 
rational multiples of tt to the argument (see equality (ICTI) and (IMll l. 

Observe that an arbitrary amount of precision can be lost in the approxi¬ 
mation of a; -I- A:7r and f — f when x ~ —kn. Hence, iterative computations 
might be necessary in the approximation (see Line I4I22I29I38I in Algorithmic]). 
To show that this is essential for sine, we reconsider the perturbation analysis 
in equality m for f(x) = sin(a:). The quantity jcc • cot(a:)| can be arbitrary 
large for x ~ —kir. Thus, iterative computations are essential for approximating 
sin(x). 

7.2.2 Cosine 

Theorem 7.4. Let x be a real number represented by (m, n,p) such that | ^ | > 1 
and {m',n',p) be a representation for tt. Suppose = = ^ -|- {2k + 1)^ where 
fc € Z and 0 < ^ < ^. The value o/cos(x) can be approximated as follows: 

i. If 0 < ^ < 1, then cos(x) can be represented by {mi,ni,p — ii) where: 


ii =si + ti 


mi 

ni 


Ni 


2=0 


_(-^ m 
(21-f-l)!^n'^ 


47 










Algorithm 9 Sine 


Require: expr has the shape sin x 
1: procedure COMPUTE(ea;pr,p) 

2 : Choose an odd N such that > 2 P+ 2 n°g 2 ( 2 Af-i)l +3 

3: Px P + 2 [log 2 ( 2 A — 1)] + 3 

4: repeat 

5: ^ ^ COMPUTE(a:,Pa;) 

6 : if — 1 < — < 1 then > Theorem l7.ll 


7 

8 
9 

10 

11 

12 

13: 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 




(~1)^ /m^22+l 
' n, / 


mo _ _ 

no /L^i —0 (22+1)! '' n 

return ^ 


else 


no 


^ ^ COMPUTE(4arctan(l),p^) 
Choose k G such that 0 < ^ + 

m . _ Tn _j_ j TYi' 

rT n ' n ^ 


Choose Si £ N such that Si > log 2 (- 


< 


min(| — I ,fc ) 

IH-—- 

max( I — I, k ) 

mind “ I + ) 


max(| ^ I, k ) 

Choose an odd Ni such that 
Choose an odd N 2 such that ( 2 iV 2 + 0 ^C^ 2 - 2 ) ^ 
if 0 < f < 1 then o Theorem OUl 

ti i — 2[log2(2Ai — 1)] + 3 
if Pa; — Si — ti > p then 

mi _ v^A^l ( — 1) 

Z^i =0 


m ^1=0 (2i+l)! 

return 2li 

m 

else 

Pa: t- Pa; + 1 

else if 1 < = < 2 then 

— n 


/ m\ 2 *+l 
^ n / 


t2 ^ max(2|'log2(2Ai — 1)] 
if Pa; — Si — ^2 — 2 > p then 

^Ni (-!)• /mN2i+l 


> Theorem 17.3liil 
3,2riog2(2A2-2)l +3) 


m 2 ( — 1 ) / m ^ ^A '2 ( — 1 ) f m 

112 “ 2 ^ 1=0 ( 2 i+l)! V 2 n'' j I Z^i =0 ( 2 i)!'' 2 ri:’ / 

return — 

112 

else 

Px ^ Pa: + 1 

else [> Theorem 17.3liiil 

Choose an odd A 3 such that 

Choose an odd N 4 such that > 2 p»-«i -«2 

ts ^ max(2[log2(2Ai - 1)] + 3, 2|'log2(2A2 - 2)] + 3, 
2riog2(2A3-l)l +3,2[log2(2A4-2)l +3) 
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until true 


_ o \ '-‘'-t y / rn x-- ' ^ w \ y — -^f / rr, 

ns ~ ( 22 + 1 )! Un^ )\2^i=0 ( 22 )! Ur 

{ ^N 3 (-1)* ( m' (-!)•/ 

i Z^i=0 (2i+l)! Un' in! ) i Z^i=0 (2i)! ' 

return — 

ns 

3e 

Ps Ps + 1 


*• 2/1 < = < 2, i/ien cos(x) can 6e represented by {m 2 ,n 2 ,p — 12 ) where: 

*2 =Si + t2 + 2 

*• If “2: < ^ < +, ^/len cos(x) can 6e represented by {mz,n^,p — * 3 ) where: 

*3 =S1 + S2 + ^3 + 6 

5 -^<-'>‘(S<^< 5 n-(Sw<so 

. / (-1)' .w' _ ^ i-^y 

l (2z + 1)! Mn' 4n / ' I Z^ ('27:il ATT/ I 


(2iy. +n' 4n' 


In the approximations above, 31 , 52 + 1 + 2 + 3+4 G N are the smallest natural 
numbers satisfying: 

mindful,(2fc+l)^) 

^1 1 max(|f^|,( 2 fe+l)^) 

Si > l0g2 - 1 ’ > l0g2 - . 

max(|^|,(2fe+l)4^) 


,5,, (2+ + 2)!(2+-l)\ 

22A^i ^ ^ ^ 

5 (2jV3 + 2)!(2+-+ 

22 N 3 ! ^ ^ 

and ti,t 2 G N are defined as follows: 


(2+ + 1)!(2+ - 2)2 
2 '2N2 

(2+ + 1)!(2+ - 2)2 
^2lv^ 


ti =2riog2(2+-l)l +3 

t 2 =max(2[log2(2+- 1)1 + 3, 2riog2(2iV2 - 2)] +3) 
ts =max(2[log2(2+ - 1)] + 3, 2|'log2(2iV2 - 2)] + 3, 
2riog2(27V3-l)l +3,2riog2(2+-2)l +3) 
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Proof. Since |^| >1, we can choose fc S Z such that = = ^ + {2k + 1)^ and 
0 < = < Suppose 2 / = a: + {2k + l)f • We use the following identity to 
calculate cos(x): 


cos(a;) = (—1)^ sin(a; + {2k + 1)^) = (—1)^ sm{y) 


We approximate y hy ^ ^ + {2k + 1)|^- From Theorem I5.4lii[ we can 

conclude that si G N units of precision will be lost in this approximation where 
Si is the smallest natural number satisfying: 


1 + 

Si > log2(- 

1 - 


mindful,(2fc+l)j^) 
max(|^|,(2fc+l) 

™in(l vl.(2fc+l)^) 

max(|-^|,(2fc+l)2^1) 


We apply Theorem l7.3l to approximate sin(y) and determine the loss of precision 
in the approximation. From Theorem IS.ll we can conclude that the factor (—1)^ 
does not influence the precision of the approximation. □ 

Algorithm [in] applies Theorem 17.21 and 17.41 to approximate cos(a;) with a 
desired precision. The algorithm first calculates x with a precision that is suf¬ 
ficient for approximating cos(a:) in the base interval. Similar to sin(a;), we use 
range reduction identities to calculate cos(a:) for an arbitrary x. 

Observe that a significant amount of precision might be lost in the approx¬ 
imation when X ~ Thus, iterative computations might be necessary 

in our approximation (see Line 141221291351 in Algorithm ITUl) . To show that these 
recomputations are essential, we reconsider the perturbation analysis of equal¬ 
ity (1801) . The quantity \x ■ tan(x)| can be arbitrary large for x « 
hence iterative computations are unavoidable for cos(a;). 


8 Related Work 

An implementation for a bottom-up approach to exact real arithmetic is pro¬ 
posed in m- For a given expression, the inputs are computed with predefined 
precisions and a bottom-up scheme is used to determine the guaranteed preci¬ 
sion of the output. Iterative computations are required if the obtained precision 
is not adequate. A formalization of a top-down approach in a theorem prover is 
proposed in m The author first provides a definition for a metric space based 
on a ball relation. Afterwards, real numbers are defined as the completion of 
the metric space Q. Rational operations are lifted to approximate operations 
on real numbers. This approach is optimized in [10) . 

Two closely-related top-down approaches based on absolute errors have been 
studied in mm- These approaches mainly differ in their approximations of 
the transcendental functions. In [3] the authors introduce a general way for 
calculating with Taylor expansions and apply this method to approximate the 
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Algorithm 10 Cosine 


Require: expr has the shape cos x 
procedure COMPUTE(e3;pr,p) 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13: 


Choose an odd N such that > 2 P+^r*° 82 ( 2 ^ 2 ) 1+2 

Px P + 2[log2(2A — 2)] + 2 

repeat 

^ ^ COMPUTE(a:,Pa:) 
if -1 < ™ < 1 then 


[> Theorem 17.21 


_ 

~ 2^2=0 


(-i)‘ 


mo _ _ 

no /-^i—0 ( 22 )! 

return ^ 


( rn\‘^ 

^ n / 


else 


no 


^ ^ COMPUTE(4arctan(l),p2:) 

Choose k £ Z such that 0 < ^ + {2k + 1) 

f + (2A: + 1)^ 


2n' 


< 


1 + 


min(| ^ |,( 2 fc 4 -l )7 


r) 


Choose Si G N such that Si > log 2 (- 


■n“(l ^ l.(2fc + l) 

“in(| ^ |,(2fe + l);g4-) 


■n“(l I'(2fe + 1) 


14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 


Choose an odd Ni such that 
Choose an odd N 2 such that ^ 

if 0 < f < 1 then o Theorem iTTlIl 

ti i — 2[log2(2Ai — 1)] + 3 
if Px — Si — ti > p then 

mi _ /_ (—1)^ 

ni ^ ^ 2-^i—O ( 22 + 1 )! ' n ^ 

return ^ 

m 

else 


Px-^Px + ^ 

else if 1 < = < 2 then > Theorem 17. 4liil 

t 2 £- max(2riog2(2Ai - 1)] + 3, 2riog2(2A2 - 2)] + 3) 
if Pa; — Si — ^2 — 2 > p then 

m 2 _ n I 1^fc/'v^JVl ( — 1)* 1 m w ( —1)* 1 m',2®\ 

na “ A ^ \2^i=0 (2i+l)\y2ny ) { l^i=0 {2iy. ^ 2n > J 

return — 

na 

else 


Px-^Px + ^ 

else [> Theorem 17.4liiil 

Choose an odd A 3 such that ( 5 )( ( 2 iV 3 + 2 A 2 Ar 3 -i)^ ) ^ 

Choose an odd A 4 such that > 2p-‘^^-‘^^ 

h £- max(2[log2(2Ai - 1)1 + 3, 2riog2(2A2 - 2)1 + 3, 
2riog2(2A3-l)l +3,2riog2(2A4-2)l +3) 
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Algorithm 10 Cosine (Continued) 


35: 


34: 


if px — Si — S 2 — > p then 



36 

37 

38 


return ^ 


else 


Px Px + I 


39: until true 


transcendental functions. In m, the approximations of the transcendental func¬ 
tions are treated separately and in a more ad-hoc way. In contrast, our approach 
is based on relative errors. We provide detailed proofs of correctness for each 
operation and use perturbation analysis to identify essential recomputations. 

In [1] the authors propose a layered framework for computations with real 
numbers. The lowest layer is an implementation of floating point arithmetic. 
In the second layer, arithmetic operations are approximated using polynomial 
models. The highest layer supports more advanced features such as differential 
operations. Proof of correctness in Coq and an implementation based on [1] 
are also available. In this article, we have focused on approximating specific 
operations, whereas in [T] polynomial models are discussed in an abstract way 
without concrete examples from well-known arithmetic operations. 

The approach introduced in |18) is based on linear fractional transformations 
(LFTs). Computations are encoded as trees of LFTs; various operations are 
defined to extract the result of a computation from the corresponding tree. 
However, this approach does not specify a top-down scheme to relate the desired 
precision in the output and the required operations on the expression tree. 
Expression trees are evaluated using lazy evaluation; computations terminate 
when adequate information is available in the root of a tree. 

A symbolic approach to exact real arithmetic has been proposed in [8] . The 
author uses infinite binary sequences in the golden ratio base to represent real 
numbers. To calculate an expression, first the symbolic techniques available in 
Maple are applied to obtain a simplified expression. Additional Maple proce¬ 
dures are implemented by the author to extract binary sequences from simpli¬ 
fied expressions. Performing operations on binary sequences is also possible. 
However, choosing a suitable balance between symbolic computations and di¬ 
rect manipulation of binary sequences depends on the given expression. As 
indicated in [8], using this approach to its full potentials requires expertise in 
Maple. Moreover, the procedure might need adaptations for each problem. 

9 Conclusion 

In this article, we proposed a simple representation for real numbers and dis¬ 
cussed a top-down approach for approximating various arithmetic operations 
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with arbitrary precision. The focus was on: 

• providing complete algorithms and proofs of correctness for the approxi¬ 
mations, and 

• perturbation analysis to identify essential iterative computations. 

Existing exact real arithmetic approaches have explored different represen¬ 
tations for real numbers; approximations for algebraic operations and transcen¬ 
dental functions have also been proposed based on these representations. As 
far as we can see, proofs of correctness for existing approaches are restricted 
to basic operations. Moreover, no formal reasoning is provided to prove the 
necessity of iterative computations. 

We envisage various extensions of the presented approach. From a practical 
point of view, some optimizations are essential. For example, the coefficients 
m and n in the representation {m,n,p) can grow rapidly during computations. 
Thus, space efficiency is a relevant concern. One can consider an alternative 
representation in which large coefficients are represented in a more efficient 
way. Moreover, the computational efficiency of the transcendental functions can 
be improved by reducing the amount of required computations (i.e, number of 
rectangles in Riemann sums, number of terms in Taylor expansions) to guarantee 
the desired precision. 

As discussed in Section [531 in certain computational problems, computing 
the expressions as they are would lead to loss of precision, whereas rewriting 
the expressions would allow us to compute them in one pass. Our top-down 
approach can be extended with a set of rewrite rules that transform problematic 
expressions into expressions that can be calculated in one pass. 

Acknowledgement This research was supported by the Dutch national pro¬ 
gram COMMIT and carried out as part of the Allegio project. 


References 

[1] P. Collins, M. Niqui, and N. Revol. A validated real function calculus. 
Mathematics in Computer Science, 5(4):437-467, 2011. 

[2] A. Feldstein and P. Turner. Overflow, underflow, and severe loss of signifi¬ 
cance in floating-point addition and subtraction. IMA journal of numerical 
analysis, 6(2):241-251, 1986. 

[3] P. Gowland and D. Lester. The correctness of an implementation of exact 
arithmetic. In Proceedings of RNC 2000, volume 140, 2000. 

[4] P. Hertling. Computable real functions: Type 1 computability versus type 
2 computability. In Proceedings of CCA 1996. Mathematik/lnformatik, 
Universitat Trier, 1996. 

[5] N. Higham. Accuracy and stability of numerical algorithms. SIAM, 2002. 


53 


[6] N. Higham. Functions of matrices: theory and computation. SIAM, 2008. 

[7] A. Hohmann and P. Deuflhard. Numerical Analysis in Modern Scientific 
Computinq: An Introduction, volume 43. Springer Science & Business Me¬ 
dia, 2012. 

[8] T. Kelsey. Exact numerical computation via symbolic computation. In 
Proceedings of CCA 2000, pages 187-197. Springer, 2000. 

[9] D. Kincaid and E. Cheney. Numerical analysis: mathematics of scientific 
computing, volume 2. American Mathematical Soc., 2002. 

[10] R. Krebbers and B. Spitters. Type classes for efficient exact real arithmetic 
in Coq. Logical Methods in Computer Science, 9(l:l):l-27, 2013. 

[11] G. Kreisel, D. Lacombe, and J. Shoenfield. Partial recursive functionals 
and effective operations. Constructivity in Mathematics, Studies in Logic 
and the Foundations of Mathematics, pages 290-297, 2000. 

[12] B. Kushner and L. Lefman. Lectures on constructive mathematical analysis, 
volume 60. American Mathematical Soc., 1984. 

[13] T. M. library, http://www.mpfr.org. Visited: August 2015. 

[14] A. Markov. On the continuity of constructive functions (in Russian). Us- 
pekht Mat. Nauk (NS), 9:226-230, 1954. 

[15] V. Menissier-Morain. Arbitrary precision real arithmetic: design and al¬ 
gorithms. The Journal of Logic and Algebraic Programming, 64(l):13-39, 
2005. 

[16] N. Muller. The iRRAM: Exact arithmetic in C-I--I-. In Proceedings of CCA 
2001, pages 222-252. Springer, 2001. 

[17] R. O’Connor. Certified exact transcendental real number computation in 
Coq. In Theorem Proving in Higher Order Logics, pages 246-261. Springer, 
2008. 

[18] P. Potts. Exact real arithmetic using Mobius transformations. PhD thesis, 
PhD-thesis, Imperial College London, 1998. 

[19] M. Richter and K. Wong. Computable preference and utility. Journal of 
Mathematical Economics, 32(3):339-354, 1999. 

[20] M. Spivak. Calculus. Benjamin, 1967. 

[21] K. Weihrauch. Computable analysis: an introduction. Springer Science & 
Business Media, 2012. 


54 



A Useful Propositions &: Lemmas 


Proposition 1. For any p G N the following inequalities hold: 

(l + i_)2 < 1 + ^ 

1- ^ < {l-—f 

2p—2 — ' 2p ' 


Proof. For inequality (1841) we can write: 

1 


(1 + —= 1 + — H- 

Similarly for inequality (1851) we have: 


1 1 1 
^ 1 F ^—t — 1 F 


2P 


-1 


2P 


-1 


2P 


-2 


(1 - —= 1 F - 

'' 2^p' 22p 




2P 


-1 


2P 


-2 


Proposition 2. For any p G N"*" the following inequalities hold: 

2P ^ ^ 1 

2p _ 1 - 2P-1 

I 1 ^ 2P 

2P-1 - 2P + 1 

Proof Since p G N"*" we have 2 p > 2. For inequality (IMl) we have: 

2P 2P -I+ l 1 1 

-=-= 1 F- < 1 F-r 

2P - 1 2P - 1 2P - 1 “ 2P-1 

Similarly for inequality (|87ll we can write: 

2P _ 2P + 1 - 1 _ ^ 1 ^ ^ 1 

2P + 1 “ 2P + 1 “ ~ 2P + 1 “ ~ 2P-1 


(84) 

(85) 


□ 


( 86 ) 

(87) 


□ 


Proposition 3. For any 0 < y < 1 the following inequality holds: 

log2(lFy)<^^ (88) 

Proof. Based on Taylor’s theorem, we can write the following expansion for 
log2(l Fy): 


N 

log 2 (l Fy) = 

i=l 




F 


(-l)AfyJV + l 


iln(2) (iV 


l)(l + c,)^+iln(2) 
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where 0 < Cy < y. For iV = 1 we obtain: 


log2(l +j/) = 


ln(2) 2(l + ci)2 1n(2) 

2 

The quantity 2 (i+ef)^ in( 2 ) positive and hence log 2 (l + y) < 


□ 


Proposition 4. For xeR\{0} we have |I < 1- 

Proof. We analyze the derivative of fix) = | I = (i+.^) rrctan(.) 

and find the intervals in which f{x) is increasing/decreasing. 

,, . arctan(a;)(l — — a; 

^ (1 + x^)^(arctan(x))^ 

We analyze the derivative in the following cases: 

1. Suppose 0 < X < 1. We use the Taylor expansion of arctan(x) to rewrite 
arctan(x)(l — x^) — x: 


(— 1 )* 

arctan(x)(l — x^) — x = ^ ^ — x — x^ arctan(x) 


i=0 

OO 


“ ^ 2 ' ~ arctan(x) 


OO 


= E '(i—T ■ 4^) ■ arctan(x) 


The term —x^ arctan(x) is negative and —x"** ^~ifpi) is negative for 
i > I and 0 < x < 1. Thus, the summation ~ IFfl) ~ 

x^ arctan(x) is negative and /'(x) < 0. The function /(x) is decreasing 
for 0 < X < 1. 


2. Suppose X > 1. In this case, arctan(x)(l — x^) — x < 0. Thus, /'(x) < 0 
and fix) is decreasing for x > 1. 

3. Suppose X < 0. Since /'(—x) = —fix), we can conclude from the first 
two cases that /(x) is increasing when x < 0. 

The case analysis shows that /(x) is bounded from above by lima;_>o fix) = 1. 

□ 


Proposition 5. For any y € (—1, 0) U (0,1) and z G N the following inequalities 
hold: 


(4t + 1)! 
(4t+ 1)! 


,,4i+l 


,,4i+l 


(_ 1 ) 2*+1 

(4t + 3)! 
(_ 1 ) 2*+1 

(4t + 3)! 


y4.+3 

y4.+3 


> 0 
< 0 


ify>0 


ify<0 
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Proof. We can rewrite the summation as follows: 


(- 1 )" ^ (- 1 )" . 

(4t + l)!^ (4t + 3)!^ (4i + l)!^ ^ (4t + 2)(4t + 3) ^ 

2 , 

Since 1 — ( 4 i_|_ 2 ^( 4 i+ 3 ) ^ 1 ~ g > 0 the sign of the summation is determined by 
the sign of 

□ 


Proposition 6. For any y £ (—1,0) U (0,1) and i G N the following inequality 
holds: 


(4t)! 


,Ai 


(_ 1 ) 2*+1 
(4t + 2)! 


,,4i+2 


> 0 


Proof. We can rewrite the left hand side of the inequality as follows: 


(4z)! 


,,4i 


+ 


(_ 1 ) 2 .+ 1 

(4t + 2)! 


y4^+2 ^ 


(4z)! 




(4t + l)(4t + 2) 


y^) 


Since 1 — ( 4 i_|_i)^( 4 i_|_ 2 ) 1/^ > 1 — 5 > 0, the inequality holds. 


□ 


Proposition 7. For x £ (—1,1) \ {0} we have \x ■ cot(x)| < 1. 

Proof. For the interval x G (—1,1) \ {0}, we have \x ■ cot(x)| = x ■ cot(x). We 
analyze the derivative of f{x) = x ■ cot(a:) in (—1,1) \ {0}. 


f'{x) = cot(a;) — x{\ + cot^(a;)) 


sin(x) cos(a;) — x 
sin^(x) 


sin(2a;) — 2x 
2sin^(a;) 


The point a; = 0 is a critical point for f{x). Since |sin(2x)| < |2a;|, f{x) is 
increasing in (—1, 0) and decreasing in (0,1). Thus, f{x) is bounded from above 
by lim 2 ,^o /(a:) = 1. 

□ 


Proposition 8. For x £ (—1,1) we have \x ■ tan(a;)| < tan(l). 

Proof. For the interval x £ (—1,1), we have |x-tan(x)| = x-tan(a;). To determine 
an upper-bound for x ■ tan(a;), we analyze the derivative of f{x) = x ■ tan(x): 


f'{x) = ian{x) + x(\ -f tan^(a;)) 


sin(x) cos(a;) -I- x 
cos^(x) 


sin(2x) -f 2a; 
2cos^(a;) 


The quantity sin(2a;)-|-2a; is negative in (—1,0) and positive in (0,1). Thus, f{x) 
is decreasing in (—1,0) and increasing in (0,1). We conclude that f{x) < tan(l). 

□ 


Lemma 1. Let x be a real number represented by {m,n,p). Then a;* can be 
represented by {m'‘,n',p— 2 [log 2 ]). 
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Proof. We can apply Theorem l5.2l to calculate x® = x Let k = [log^] 
and P{i) denote the precision that we lose by calculating x®. From Theorem l5.2l 
we can write: 

p(*) < p(r^i) + 2 < p(r^i)+2fc = 2fc 

Thus, we lose 2k = 2|'log2] units of precision by calculating x®. 

□ 
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